- From: Yvette P. Hoitink <y.p.hoitink@heritas.nl>
- Date: Fri, 20 Feb 2004 00:18:54 +0100
- To: "'WAI WCAG List'" <w3c-wai-gl@w3.org>
Hello list, During the telecon I mentioned the problem of compound words in Dutch and other Germanic languages. In these languages, words can be created on the fly by concatenating two (or more!) existing words. These new compounded words cannot be found in a dictionary, they have to be split up into their constituents which can in turn be looked up. However, this splitting up can be difficult because sometimes multiple splits are possible. Because of the compound word problem, I am against requiring that the meaning of all words can be found in an associated dictionary at level 1. Many papers have been written about this topic by scientists in the area of natural language processing (which happens to be my background as well). Splitting up compound words is not a problem for which an easy algorithm can be found, as John Slatin hoped. I found some online papers that discuss compound words, including splitting them up, which I though might be useful for the people working on the language guidelines: Compound splitting and lexical recombination for improved performance of a speech recognition system for German parliamentary speeches: http://citeseer.nj.nec.com/larson00compound.html Empirical methods for compound splitting: http://www.isi.edu/~koehn/publications/compound2003.pdf Compounds in dictionary-based cross-language information retrieval: http://informationr.net/ir/7-2/paper128.html Corpus-driven splitting of compound words: http://citeseer.nj.nec.com/507649.html Analysis of Korean compound nouns using statistical information: http://citeseer.nj.nec.com/229559.html For the citeseer documents, you can view the papers by selecting your preferred format in the top right corner (or follow the "PDF" link for a PDF version). Yvette Hoitink CEO Heritas, Enschede, the Netherlands E-mail: y.p.hoitink@heritas.nl
Received on Thursday, 19 February 2004 18:19:01 UTC