Tag: work

SFCM 2011

Posted by – August 28, 2011

I’m back from my first ever scientific conference, SFCM 2011 in Zurich. My top two favourite talks were Lauri Karttunen’s keynote, Beyond Morphology – Pattern Matching with FST and Non-canonical inflection by Benoît Sagot and Géraldine Walther. An honourable mention goes to Morphology to the Rescue Redux: Resolving Borrowings and Code-mixing in Machine Translation by Esmé Manandise and Claudia Gdaniec. I demoed stuff for our HFST3 paper.

Karttunen presented some obvious-in-retrospect extensions to FST matching, rewriting and tagging and an implementation thereof in an algorithm/utility called pmatch. It’s mostly a combination of recursive transition networks and the insight that with some algorithmic trickery, it’s sufficient to match the end of a subpattern when you want to do left-to-right longest-match matching/tagging. The extensions he described most were

  • EndTag(), which is a command that gets compiled into special instructions for pmatch to wrap a pattern or subpattern in tags without the need to produce a transducer that’s always trying to output the start tag and enter failing transitions of the subpattern network, and
  • Ins(), which in RTN-style refers to a separate network to be pseudo-inserted at the current location.

These are achieved with flag diacritic -style special symbols, although pmatch itself doesn’t support flag diacritics. Hopefully we’ll have all this functionality in HFST one day, alongside flag-induced hyperminimization – an interesting topic I should write about one day. Put together, these techniques should significantly remedy the problems of networks becoming combinatorically huge in certain situations.

Intermission

For the benefit of people who aren’t interested in computational morphology, here’s some travel stuff.

I’m not a big fan of travel, and was reminded why by almost everything going wrong. My flight was cancelled, and I had to queue for ages to be rerouted via Brussels, and almost missed that flight as well. All told, it took me over 10 hours to get from my house to the hotel in Zurich, leaving less time than I’d hoped to prepare for the demonstration session. And everything was sucky and expensive and my feet hurt and it’s just not worth it to ever leave home :(

Also, Blue1 is a terrible airline company and Swiss is nice (you get free chocolate).

Switzerland is about as orderly, clean and organized as you might imagine. A while ago a Japanese post-doc at the math department was leaving Helsinki to go to do math at an American university, and he sent a nice going-away email to people he’d met in Finland. He wrote “Finland is the 2nd most well-organized country among the places I have ever been (unfortunately you could not beat Japan, sorry!)” – I think he must have missed out on Switzerland.


Famous Swiss hospitality

(That said, there were definitely more representatives of ethnic minorities than in, say, Helsinki.)

The Swiss don’t mess around; each and every lamppost had a sticker like this:

Does it work?

I never saw a single extraneous piece of paper on these things.

Also, a little-known fact: Swiss people are in fact made out of polished steel.

I like the place. These guys know how to live.

End of intermission

Benoît and Géraldine had done work on a system for compactly describing certain irregular (“non-canonical”) phenomena in inflection:

  • suppletion (where some forms have an alternate stem or affixes)
  • heteroclesis (where some words have a mixed paradigm from several regular forms)
  • defectiveness (where certain forms are missing from the paradigm)
  • overabundance (where some forms have more than one realisation)
  • depondency (where certain words inherit part of another’s paradigm in the “wrong” context, eg. singular suffixation for expressing plural in some Croatian nouns)

They had used their approach to describe French irregular verbs, and also implemented several other well-known descriptions by French linguists. They wanted to show that their approach was best or most natural (at least most compact), and did so by estimating the Kolmogorov complexity of these schemes. This is something I’ve often thought about doing (examining linguistic theories by implementing them), so I’m happy that work is happening in this area.

Overall, SFCM was damn well organized, interesting, motivating and fun to attend – many thanks to the organizers, speakers and attendees!

Hell, world!

Posted by – September 3, 2009

I am supposed to learn enough libtool and autotools to package our current library and utilities in a “do it right” way. Some of my favourite things about this task so far:

  • From the libtool manual:

    But of course, that would be too simple, so many systems require that you run the ranlib command on the resulting library (to give it better karma, or something)

  • There’s a libtool demo of a trivial Hello World program & library packaged with autotools. configure.ac and Makefile.am are about 250 lines put together.
  • The program part of the demo is called “hello”, but the library is called “hell”

find-file-other-frames

Posted by – August 6, 2009

This post is for emacs users.

Every day I start work by opening a bunch of files in emacs frames. Sometimes they’re all .java, sometimes .cc and .h. Sometimes they’re all the files in some directory. C-x 5 f this.cc, C-x 5 f that.cc, C-x 5 f the-other.cc etc. I could use wildcards, but then all the files would be loaded into the same frame. There must be a better way! The following remaps the bindings for find-file-other frame to a function called find-file-other-frames, which loads one file into the current frame and the rest (found by using wildcards) into new frames of their own. If you only want to find one file, it’s opened into a new frame as usual.

;; a find-file-other-frame that for multiple files opens a new frame for each
;; one except the first
(defun find-file-other-frames (filename &optional wildcards)
  "Edit file FILENAME, in another frame.
  Like `find-file-other-frame', but in the case of multiple files loads the
  first one into the current frame and creates new frames to each of
  the remaining ones."
  (interactive (find-file-read-args "Find file(s) in other frame(s): " nil))
  (let ((value (find-file-noselect filename nil nil wildcards)))
    (if (listp value)
      (progn
        (setq value (nreverse value))
        (cons (switch-to-buffer (car value))
          (mapcar 'switch-to-buffer-other-frame (cdr value))))
      (switch-to-buffer-other-frame value))))
 
(define-key ;; replace the keybindings
  (current-global-map) [remap find-file-other-frame] 'find-file-other-frames)

edit: a perhaps better way to do the last two lines:

(substitute-key-definition
  'find-file-other-frame 'find-file-other-frames (current-global-map))

Flag diacritics

Posted by – May 11, 2009

As some readers will be aware, I’ve been “implementing flag diacritics” at my new job. This post is all about what that means.

We have a reader program which reads in morphological transducers and uses them to analyse words. A morphological transducer is a (representation of a) collection of rules about a language’s inflection. For example, if you give the French morphology transducer we have the word déclare it will output:

déclarer+verb+singular+imperative+present+secondPerson
déclarer+verb+singular+indicative+present+firstPerson
déclarer+verb+singular+indicative+present+thirdPerson
déclarer+verb+singular+subjunctive+present+firstPerson
déclarer+verb+singular+subjunctive+present+thirdPerson

From morphology alone we don’t know which of those is right (that’s a matter for another blog post), but those are the only possibilities. For example “déclarer+verb+singular+indicative+present+secondPerson” doesn’t appear because that would be (tu) declares.

Flag diacritics are a way to express long-distance morphological rules. For example, let’s say you have a language with productive compounding (one in which lots of words can form compounds with each other, like Finnish) and in which grammatical suffixes vary according to whether another word is going to be compounded onto them. A simple way to express this is to add a marker to one class of suffixes saying “another noun must be compounded onto this” and not to have it in the other class. In the Sámi morphology there’s something like this going on (but with noun-verb-compounding) and it’s controlled with the following flag diacritics:

@P.NeedNoun.ON@
@D.NeedNoun.ON@
@C.NeedNoun@

As you might have guessed, flag diacritics are always delimited by @-signs. @P.NeedNoun.ON@ means “set the NeedNoun feature to have the value ON”, @D.NeedNoun.ON@ means “if the NeedNoun feature has the value ON, this combination is disallowed” and @C.NeedNoun@ means “clear the value of the NeedNoun feature”. Before support for this was added, the Sámi transducer gave ten analyses for the word láhkaásahus, two of which were:

láhkka+N+SgGenCmp#@P.NeedNoun.ON@ásahit+V+TV+Imprt+Prs+Sg3@D.NeedNoun.ON@#
láhkka+N+SgGenCmp#ásahus@C.NeedNoun@+N+Sg+Nom@D.NeedNoun.ON@#
(# means “word boundary”)

The first one shouldn’t appear at all because first we set NeedNoun to ON (because we’re trying to interpret the ásahus part as a compounding verb which should be followed by a noun) and then disallow it (because we’ve reached the end of the word so we’re not going be compounding any more nouns). The second, however, is ok: first we clear NeedNoun (which changes nothing since it hadn’t been set in the first place), then @D.NeedNoun.ON@ says “NeedNoun must not be set to ON”, which it isn’t. Also we of course shouldn’t be outputting the flag diacritics themselves. The desired output out of those two is therefore

láhkka+N+SgGenCmp#ásahus+N+Sg+Nom#

Out of the ten possible analyses of láhkaásahus six are disallowed by the flag diacritics, so in this case it’s a pretty important rule. For any Sámi enthusiasts out there, the four currently produced analyses are

láhkka+N+SgGenCmp#ásahus+N+Sg+Nom#
láhka#ásahus+N+Sg+Nom#
láhka+N+SgCmp#ásahus+N+Sg+Nom#
láhka+N+SgNomCmp#ásahus+N+Sg+Nom#

Just carrying out my activities

Posted by – December 4, 2007

My dad once wrote a column about how sometimes a concept is difficult to translate not because you can’t think of the right expression but because there is none. Even if you somehow find a good way to describe what the original text says, anyone reading it in the target language will still have no idea what’s going on. These situations often indicate hard-to-pin-down differences in the way languages and cultures are.

I’ve started to run into this myself in my budding working life. I’m trying to “fix” the English in a presentation about a tourist resort and struggling with “programme services”. What is that? In Finnish it’s obviously been “ohjelmapalvelut”, but I suspected that in English “programme services” doesn’t mean anything. I googled the term and sure enough all the hits are either Finnish tourism brochure-type things (the top hit was Espoon Matkailu) or something to do with tv companies. Evidently, translators from Finnish have conspired to decide that this concept which apparently doesn’t exist in other languages is to be “programme services” no matter how little sense it makes. But I can’t possibly live with that, it’s just… wrong. So now I’ve agonised over it for maybe half an hour and come up with reworking the sentence completely to use “activity”, a wonderful word that turns up rather a lot in any commercial translation from Finnish to English.

I just hope there aren’t too many more ohjelmapalvelus coming up.