Lemmatising Finnish Words: 1
Taming Synthetic Morphology
A lemma is the “dictionary head” of inflectional and derivational word forms. For example, variations like ‘speaks’, ‘spoke’, ‘spoken’, ‘speaking’, ‘speaker’, ‘speakers’, etc. occur in natural sentences, but their meanings are found under the entry ‘SPEAK’. Lemmatisation is the process of mapping forms to their lemma.
This is relatively easy to do in English and German. You can usually find a big table listing each lemma with their known associated forms. For English, I used the 14K list manually combed by Nation. For German there’s Morphy.
It’s not so simple for Finnish, due to compounds, syntheticism, phonological processes along the morpheme boundaries.
Compounds are fairly common:
e.g. kulttuuri+pää+kaupunki = cultrual capital
And there are contrived examples where the whole sentence is one word:
e.g. Juokse+n+teli+sin+ko+han? = I wonder if I should run around a bit?
So no look-up tables. There’s gotta be a more sophisticated approach. I shopped around and found a lot of hefty papers reflective of the recent research on this problem. I looked through them in the sense pre-literate children flip through books that have pictures. But my clue+less+ness did not stop me from fiddling with their programs...
Attempt #1
I installed the Morfessor 2 Python package based on:
Creutz, M., & Lagus, K. (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL-02 workshop on Morphological and phonological learning-Volume 6 (pp. 21–30). Association for Computational Linguistics.
Then to train the model you gotta have some Finnish corpus data, so I fetched some from: http://www.corpora.heliohost.org/index.html
I used the newspaper corpus cut down to 100K lines to train the model in a process that could take forever (each iteration goes for 10 minutes, and the training stops when the cost stops decreasing).
The results are somewhat underwhelming (it also happens to be under-segmented, e.g. the very basic ending of -ssa was not isolated).
kynttilä+n+valossa <- kynttilä+n+valo+ssa
pukeutumis+essa
vaate+liikkeistä+kin
luksus+merkkejä
sonnusta+utune+ella
The under-segmentation bears uncanny resemblence to the unsupervised results shown in their demo. Except my model just spat ‘kansanedustaja’ back out as ‘kansanedustaja’.
Attempt #2
This time I downloaded the Parole frequency data and formatted it as one form per line with count, to be used for the “list training mode”.
Certainly a lot faster. What about the results?
kynttilän+valossa
ky+nttilän+valossa
☞kynttilä+n+valossa
k+ynttilän+valossa
kynttilänvalossa
pukeutumisessa
☞pukeutumis+essa
pukeutumis+e+ssa
pukeutumis+es+sa
pukeutumis+ess+a
vaate+liikkeistäkin
☞vaate+liikkeistä+kin
vaateliikkeistäkin
vaateliikkeistä+kin
vaate+liikkeistäk+in
☞luksus+merkkejä
luksusmerkkejä
luksus+me+rkkejä
luksus+m+erkkejä
luk+susmerkkejä
s+onnu+sta+utu+ne+ella
☞son+nu+sta+utu+ne+ella
s+on+nu+sta+utu+ne+ella
s+onnu+sta+utun+e+ella
son+nu+sta+utun+e+ella
kansanedustaja
kansan+edustaja
kansan+edusta+ja
☞kansa+n+edustaja
kans+an+edustaja
ajatellessani
☞ajatellessa+ni
ajatellessa+n+i
a+ja+telle+ssa+ni
ajat+el+les+sani
Now glancing through the n-best suggestions, there’s some hope, though it’s far from ideal. Perhaps I should go and investigate Kohonen et al. (2010)…