Tag: linguistics

Syllable counts

Posted by – November 25, 2009

I learnt on Wikipedia that the intro to Whipping Post is in 11/4 time. 11/4, what the hell? Eventually I figured out how that goes, and also noticed that Finnish is no good for counts that go above 10. Till then most numbers have natural one-syllable abbreviations, but 11 doesn’t. It has 4 syllables, one more than in English. Hmm. Long story short, I made a graph of the syllable counts of the counting numbers up to 100 in 9 languages (thanks to Zet@#aspekti for many of them):


(click to see large version)

Some observations:

  • French is most compact, except in the 80-100 range where English (which is quite consistent overall) is best
  • Finnish really is verbose
  • Northern Sámi is most boring
  • Estonian is quite interesting
  • Everybody counts in base 10
  • 3-4 syllables is a sweet spot
  • Graphs with too many lines in them are difficult to read

Btw, a good solution to the Finnish problem: go hexadecimal. Yks kaks kol nel viis kuu see kaa yy aa bee cee dee ee äf.

edit: oh, and code to languages:

Suomi = Finnish
Français = French
Svenska = Swedish
Eesti = Estonian
Davvisámengiella = Northern Sámi, spoken in Lapland
Magyar = Hungarian
Afsoomaali = Somali
Komi = Komi, a Uralic language

Flag diacritics

Posted by – May 11, 2009

As some readers will be aware, I’ve been “implementing flag diacritics” at my new job. This post is all about what that means.

We have a reader program which reads in morphological transducers and uses them to analyse words. A morphological transducer is a (representation of a) collection of rules about a language’s inflection. For example, if you give the French morphology transducer we have the word déclare it will output:

déclarer+verb+singular+imperative+present+secondPerson
déclarer+verb+singular+indicative+present+firstPerson
déclarer+verb+singular+indicative+present+thirdPerson
déclarer+verb+singular+subjunctive+present+firstPerson
déclarer+verb+singular+subjunctive+present+thirdPerson

From morphology alone we don’t know which of those is right (that’s a matter for another blog post), but those are the only possibilities. For example “déclarer+verb+singular+indicative+present+secondPerson” doesn’t appear because that would be (tu) declares.

Flag diacritics are a way to express long-distance morphological rules. For example, let’s say you have a language with productive compounding (one in which lots of words can form compounds with each other, like Finnish) and in which grammatical suffixes vary according to whether another word is going to be compounded onto them. A simple way to express this is to add a marker to one class of suffixes saying “another noun must be compounded onto this” and not to have it in the other class. In the Sámi morphology there’s something like this going on (but with noun-verb-compounding) and it’s controlled with the following flag diacritics:

@P.NeedNoun.ON@
@D.NeedNoun.ON@
@C.NeedNoun@

As you might have guessed, flag diacritics are always delimited by @-signs. @P.NeedNoun.ON@ means “set the NeedNoun feature to have the value ON”, @D.NeedNoun.ON@ means “if the NeedNoun feature has the value ON, this combination is disallowed” and @C.NeedNoun@ means “clear the value of the NeedNoun feature”. Before support for this was added, the Sámi transducer gave ten analyses for the word láhkaásahus, two of which were:

láhkka+N+SgGenCmp#@P.NeedNoun.ON@ásahit+V+TV+Imprt+Prs+Sg3@D.NeedNoun.ON@#
láhkka+N+SgGenCmp#ásahus@C.NeedNoun@+N+Sg+Nom@D.NeedNoun.ON@#
(# means “word boundary”)

The first one shouldn’t appear at all because first we set NeedNoun to ON (because we’re trying to interpret the ásahus part as a compounding verb which should be followed by a noun) and then disallow it (because we’ve reached the end of the word so we’re not going be compounding any more nouns). The second, however, is ok: first we clear NeedNoun (which changes nothing since it hadn’t been set in the first place), then @D.NeedNoun.ON@ says “NeedNoun must not be set to ON”, which it isn’t. Also we of course shouldn’t be outputting the flag diacritics themselves. The desired output out of those two is therefore

láhkka+N+SgGenCmp#ásahus+N+Sg+Nom#

Out of the ten possible analyses of láhkaásahus six are disallowed by the flag diacritics, so in this case it’s a pretty important rule. For any Sámi enthusiasts out there, the four currently produced analyses are

láhkka+N+SgGenCmp#ásahus+N+Sg+Nom#
láhka#ásahus+N+Sg+Nom#
láhka+N+SgCmp#ásahus+N+Sg+Nom#
láhka+N+SgNomCmp#ásahus+N+Sg+Nom#

Put this and a language parser in your pipe and smoke it

Posted by – November 11, 2008

The well-known examples of ambiguity in natural language mostly stem from polysemy (especially across word categories), viz. Time flies like an arrow. In the “advanced” semantics & pragmatics class now being taught at the linguistics department I’ve become more attuned to the room for nuance in semantic (sometimes called thematic) roles, especially in Finnish.

One aspect of this is that many undergoer-roles are expressed with the accusative case which, morphologically speaking, doesn’t independently exist. By this I mean that Finnish isn’t considered by Fennicists to have an accusative case but to instead mark direct objects with the genitive (perfect aspect, söin omenanI ate an apple) and partitive (progressive aspect, söin omenaaI was eating an apple) cases. But from a typological (comparative linguistics, if you will) point of view, these together with the special word forms of personal pronouns (minut, sinut, hänet, meidät, teidät, heidät) constitute an accusative case.

This leads to entertaining ambiguities between possessives and objects. My favourite example is from an actual headline from some years ago:

Mies ampui vaimonsa kännykän haulikolla
Man shoot+imperf wife+gen+poss(of wife by man) mobile-phone+gen shotgun+adessive(”with”)
Man shoots wife’s mobile phone with shotgun

Some of the ambiguity is present in the English translation as polysemy, but notably here both “wife” and “mobile phone” are in the genitive, so either one (or neither!) can be taken as the direct object. If the mobile phone is a possessive form, the shotgun is naturally a special function present in it. I’m sure there would be demand for something like this in the market. Thus we may read:
The man shot his wife’s mobile phone with a shotgun
The man shot his wife with the shotgun function in his mobile phone
The man shot (something) with the shotgun function of his wife’s mobile phone

Additionally:
The man shot (used as ammunition) his wife’s mobile phone with a shotgun
The man shot (used as ammunition) his wife with the shotgun function in his mobile phone

Of course, this is a bit pathological – mostly syntax/morphology as semantic marking is sensible business.