FYI: Resources for splitting up compound words

Hello list,

During the telecon I mentioned the problem of compound words in Dutch and
other Germanic languages. In these languages, words can be created on the
fly by concatenating two (or more!) existing words. These new compounded
words cannot be found in a dictionary, they have to be split up into their
constituents which can in turn be looked up. However, this splitting up can
be difficult because sometimes multiple splits are possible. 

Because of the compound word problem, I am against requiring that the
meaning of all words can be found in an associated dictionary at level 1.

Many papers have been written about this topic by scientists in the area of
natural language processing (which happens to be my background as well).
Splitting up compound words is not a problem for which an easy algorithm can
be found, as John Slatin hoped. 

I found some online papers that discuss compound words, including splitting
them up, which I though might be useful for the people working on the
language guidelines:

Compound splitting and lexical recombination for improved performance of a
speech recognition system for German parliamentary speeches:
http://citeseer.nj.nec.com/larson00compound.html

Empirical methods for compound splitting:
http://www.isi.edu/~koehn/publications/compound2003.pdf

Compounds in dictionary-based cross-language information retrieval:
http://informationr.net/ir/7-2/paper128.html

Corpus-driven splitting of compound words:
http://citeseer.nj.nec.com/507649.html

Analysis of Korean compound nouns using statistical information:
http://citeseer.nj.nec.com/229559.html

For the citeseer documents, you can view the papers by selecting your
preferred format in the top right corner (or follow the "PDF" link for a PDF
version). 

Yvette Hoitink
CEO Heritas, Enschede, the Netherlands
E-mail: y.p.hoitink@heritas.nl

Received on Thursday, 19 February 2004 18:19:01 UTC