Flag diacritics

Posted by – May 11, 2009

As some readers will be aware, I’ve been “implementing flag diacritics” at my new job. This post is all about what that means.

We have a reader program which reads in morphological transducers and uses them to analyse words. A morphological transducer is a (representation of a) collection of rules about a language’s inflection. For example, if you give the French morphology transducer we have the word déclare it will output:

déclarer+verb+singular+imperative+present+secondPerson
déclarer+verb+singular+indicative+present+firstPerson
déclarer+verb+singular+indicative+present+thirdPerson
déclarer+verb+singular+subjunctive+present+firstPerson
déclarer+verb+singular+subjunctive+present+thirdPerson

From morphology alone we don’t know which of those is right (that’s a matter for another blog post), but those are the only possibilities. For example “déclarer+verb+singular+indicative+present+secondPerson” doesn’t appear because that would be (tu) declares.

Flag diacritics are a way to express long-distance morphological rules. For example, let’s say you have a language with productive compounding (one in which lots of words can form compounds with each other, like Finnish) and in which grammatical suffixes vary according to whether another word is going to be compounded onto them. A simple way to express this is to add a marker to one class of suffixes saying “another noun must be compounded onto this” and not to have it in the other class. In the Sámi morphology there’s something like this going on (but with noun-verb-compounding) and it’s controlled with the following flag diacritics:

@P.NeedNoun.ON@
@D.NeedNoun.ON@
@C.NeedNoun@

As you might have guessed, flag diacritics are always delimited by @-signs. @P.NeedNoun.ON@ means “set the NeedNoun feature to have the value ON”, @D.NeedNoun.ON@ means “if the NeedNoun feature has the value ON, this combination is disallowed” and @C.NeedNoun@ means “clear the value of the NeedNoun feature”. Before support for this was added, the Sámi transducer gave ten analyses for the word láhkaásahus, two of which were:

láhkka+N+SgGenCmp#@P.NeedNoun.ON@ásahit+V+TV+Imprt+Prs+Sg3@D.NeedNoun.ON@#
láhkka+N+SgGenCmp#ásahus@C.NeedNoun@+N+Sg+Nom@D.NeedNoun.ON@#
(# means “word boundary”)

The first one shouldn’t appear at all because first we set NeedNoun to ON (because we’re trying to interpret the ásahus part as a compounding verb which should be followed by a noun) and then disallow it (because we’ve reached the end of the word so we’re not going be compounding any more nouns). The second, however, is ok: first we clear NeedNoun (which changes nothing since it hadn’t been set in the first place), then @D.NeedNoun.ON@ says “NeedNoun must not be set to ON”, which it isn’t. Also we of course shouldn’t be outputting the flag diacritics themselves. The desired output out of those two is therefore

láhkka+N+SgGenCmp#ásahus+N+Sg+Nom#

Out of the ten possible analyses of láhkaásahus six are disallowed by the flag diacritics, so in this case it’s a pretty important rule. For any Sámi enthusiasts out there, the four currently produced analyses are

láhkka+N+SgGenCmp#ásahus+N+Sg+Nom#
láhka#ásahus+N+Sg+Nom#
láhka+N+SgCmp#ásahus+N+Sg+Nom#
láhka+N+SgNomCmp#ásahus+N+Sg+Nom#

1 Comment on Flag diacritics

Respond | Trackback

Respond

Comments

Comments