Tag: linguistics

The long now

Posted by – November 6, 2017

I was looking into papers on the relative significance of words used in texts, and ran into an interesting-looking one about automatically generating abstracts for technical papers, only to find it’s from 1958!

(In case you’re wondering, this 60-year old task remains mostly unsolved.)

SFCM 2011

Posted by – August 28, 2011

I’m back from my first ever scientific conference, SFCM 2011 in Zurich. My top two favourite talks were Lauri Karttunen’s keynote, Beyond Morphology – Pattern Matching with FST and Non-canonical inflection by Benoît Sagot and Géraldine Walther. An honourable mention goes to Morphology to the Rescue Redux: Resolving Borrowings and Code-mixing in Machine Translation by Esmé Manandise and Claudia Gdaniec. I demoed stuff for our HFST3 paper.

Karttunen presented some obvious-in-retrospect extensions to FST matching, rewriting and tagging and an implementation thereof in an algorithm/utility called pmatch. It’s mostly a combination of recursive transition networks and the insight that with some algorithmic trickery, it’s sufficient to match the end of a subpattern when you want to do left-to-right longest-match matching/tagging. The extensions he described most were

  • EndTag(), which is a command that gets compiled into special instructions for pmatch to wrap a pattern or subpattern in tags without the need to produce a transducer that’s always trying to output the start tag and enter failing transitions of the subpattern network, and
  • Ins(), which in RTN-style refers to a separate network to be pseudo-inserted at the current location.

These are achieved with flag diacritic -style special symbols, although pmatch itself doesn’t support flag diacritics. Hopefully we’ll have all this functionality in HFST one day, alongside flag-induced hyperminimization – an interesting topic I should write about one day. Put together, these techniques should significantly remedy the problems of networks becoming combinatorically huge in certain situations.

Intermission

For the benefit of people who aren’t interested in computational morphology, here’s some travel stuff.

I’m not a big fan of travel, and was reminded why by almost everything going wrong. My flight was cancelled, and I had to queue for ages to be rerouted via Brussels, and almost missed that flight as well. All told, it took me over 10 hours to get from my house to the hotel in Zurich, leaving less time than I’d hoped to prepare for the demonstration session. And everything was sucky and expensive and my feet hurt and it’s just not worth it to ever leave home :(

Also, Blue1 is a terrible airline company and Swiss is nice (you get free chocolate).

Switzerland is about as orderly, clean and organized as you might imagine. A while ago a Japanese post-doc at the math department was leaving Helsinki to go to do math at an American university, and he sent a nice going-away email to people he’d met in Finland. He wrote “Finland is the 2nd most well-organized country among the places I have ever been (unfortunately you could not beat Japan, sorry!)” – I think he must have missed out on Switzerland.


Famous Swiss hospitality

(That said, there were definitely more representatives of ethnic minorities than in, say, Helsinki.)

The Swiss don’t mess around; each and every lamppost had a sticker like this:

Does it work?

I never saw a single extraneous piece of paper on these things.

Also, a little-known fact: Swiss people are in fact made out of polished steel.

I like the place. These guys know how to live.

End of intermission

Benoît and Géraldine had done work on a system for compactly describing certain irregular (“non-canonical”) phenomena in inflection:

  • suppletion (where some forms have an alternate stem or affixes)
  • heteroclesis (where some words have a mixed paradigm from several regular forms)
  • defectiveness (where certain forms are missing from the paradigm)
  • overabundance (where some forms have more than one realisation)
  • depondency (where certain words inherit part of another’s paradigm in the “wrong” context, eg. singular suffixation for expressing plural in some Croatian nouns)

They had used their approach to describe French irregular verbs, and also implemented several other well-known descriptions by French linguists. They wanted to show that their approach was best or most natural (at least most compact), and did so by estimating the Kolmogorov complexity of these schemes. This is something I’ve often thought about doing (examining linguistic theories by implementing them), so I’m happy that work is happening in this area.

Overall, SFCM was damn well organized, interesting, motivating and fun to attend – many thanks to the organizers, speakers and attendees!

Syllable counts

Posted by – November 25, 2009

I learnt on Wikipedia that the intro to Whipping Post is in 11/4 time. 11/4, what the hell? Eventually I figured out how that goes, and also noticed that Finnish is no good for counts that go above 10. Till then most numbers have natural one-syllable abbreviations, but 11 doesn’t. Its unabbreviated form has 4 syllables, one more than in English. Hmm. Long story short, I made a graph of the syllable counts of the counting numbers up to 100 in 9 languages (thanks to Zet@#aspekti for many of them):


(click to see large version)

Some observations:

  • French is most compact, except in the 80-100 range where English (which is quite consistent overall) is best
  • Finnish really is verbose
  • Northern Sámi is most boring
  • Estonian is quite interesting
  • Everybody counts in base 10
  • 3-4 syllables is a sweet spot
  • Graphs with too many lines in them are difficult to read

Btw, a good solution to the Finnish problem: go hexadecimal. Yks kaks kol nel viis kuu see kaa yy aa bee cee dee ee äf.

edit: oh, and code to languages:

Suomi = Finnish
Français = French
Svenska = Swedish
Eesti = Estonian
Davvisámengiella = Northern Sámi, spoken in Lapland
Magyar = Hungarian
Afsoomaali = Somali
Komi = Komi, a Uralic language

Flag diacritics

Posted by – May 11, 2009

As some readers will be aware, I’ve been “implementing flag diacritics” at my new job. This post is all about what that means.

We have a reader program which reads in morphological transducers and uses them to analyse words. A morphological transducer is a (representation of a) collection of rules about a language’s inflection. For example, if you give the French morphology transducer we have the word déclare it will output:

déclarer+verb+singular+imperative+present+secondPerson
déclarer+verb+singular+indicative+present+firstPerson
déclarer+verb+singular+indicative+present+thirdPerson
déclarer+verb+singular+subjunctive+present+firstPerson
déclarer+verb+singular+subjunctive+present+thirdPerson

From morphology alone we don’t know which of those is right (that’s a matter for another blog post), but those are the only possibilities. For example “déclarer+verb+singular+indicative+present+secondPerson” doesn’t appear because that would be (tu) declares.

Flag diacritics are a way to express long-distance morphological rules. For example, let’s say you have a language with productive compounding (one in which lots of words can form compounds with each other, like Finnish) and in which grammatical suffixes vary according to whether another word is going to be compounded onto them. A simple way to express this is to add a marker to one class of suffixes saying “another noun must be compounded onto this” and not to have it in the other class. In the Sámi morphology there’s something like this going on (but with noun-verb-compounding) and it’s controlled with the following flag diacritics:

@P.NeedNoun.ON@
@D.NeedNoun.ON@
@C.NeedNoun@

As you might have guessed, flag diacritics are always delimited by @-signs. @P.NeedNoun.ON@ means “set the NeedNoun feature to have the value ON”, @D.NeedNoun.ON@ means “if the NeedNoun feature has the value ON, this combination is disallowed” and @C.NeedNoun@ means “clear the value of the NeedNoun feature”. Before support for this was added, the Sámi transducer gave ten analyses for the word láhkaásahus, two of which were:

láhkka+N+SgGenCmp#@P.NeedNoun.ON@ásahit+V+TV+Imprt+Prs+Sg3@D.NeedNoun.ON@#
láhkka+N+SgGenCmp#ásahus@C.NeedNoun@+N+Sg+Nom@D.NeedNoun.ON@#
(# means “word boundary”)

The first one shouldn’t appear at all because first we set NeedNoun to ON (because we’re trying to interpret the ásahus part as a compounding verb which should be followed by a noun) and then disallow it (because we’ve reached the end of the word so we’re not going be compounding any more nouns). The second, however, is ok: first we clear NeedNoun (which changes nothing since it hadn’t been set in the first place), then @D.NeedNoun.ON@ says “NeedNoun must not be set to ON”, which it isn’t. Also we of course shouldn’t be outputting the flag diacritics themselves. The desired output out of those two is therefore

láhkka+N+SgGenCmp#ásahus+N+Sg+Nom#

Out of the ten possible analyses of láhkaásahus six are disallowed by the flag diacritics, so in this case it’s a pretty important rule. For any Sámi enthusiasts out there, the four currently produced analyses are

láhkka+N+SgGenCmp#ásahus+N+Sg+Nom#
láhka#ásahus+N+Sg+Nom#
láhka+N+SgCmp#ásahus+N+Sg+Nom#
láhka+N+SgNomCmp#ásahus+N+Sg+Nom#

Put this and a parser in your pipe and smoke it

Posted by – November 11, 2008

The well-known examples of ambiguity in natural language mostly stem from polysemy (especially across word categories), viz. Time flies like an arrow. In the “advanced” semantics & pragmatics class now being taught at the linguistics department I’ve become more attuned to the room for nuance in semantic (sometimes called thematic) roles, especially in Finnish.

One aspect of this is that many undergoer-roles are expressed with the accusative case which, morphologically speaking, doesn’t independently exist. By this I mean that Finnish isn’t considered by Fennicists to have an accusative case but to instead mark direct objects with the genitive (perfect aspect, söin omenanI ate an apple) and partitive (progressive aspect, söin omenaaI was eating an apple) cases. But from a typological (comparative linguistics, if you will) point of view, these together with the special word forms of personal pronouns (minut, sinut, hänet, meidät, teidät, heidät) constitute an accusative case.

This leads to entertaining ambiguities between possessives and objects. My favourite example is from an actual headline from some years ago:

Mies ampui vaimonsa kännykän haulikolla
Man shoot+imperf wife+gen+poss(of wife by man) mobile-phone+gen shotgun+adessive(“with”)
Man shoots wife’s mobile phone with shotgun

Some of the ambiguity is present in the English translation as polysemy, but notably here both “wife” and “mobile phone” are in the genitive, so either one (or neither!) can be taken as the direct object. If the mobile phone is a possessive form, the shotgun is naturally a special function present in it. I’m sure there would be demand for something like this in the market. Thus we may read:
The man shot his wife’s mobile phone with a shotgun
The man shot his wife with the shotgun function in his mobile phone
The man shot (something) with the shotgun function of his wife’s mobile phone

Additionally:
The man shot (used as ammunition) his wife’s mobile phone with a shotgun
The man shot (used as ammunition) his wife with the shotgun function in his mobile phone

Of course, this is a bit pathological – mostly syntax/morphology as semantic marking is sensible business.