W3C home > Mailing lists > Public > w3c-wai-gl@w3.org > January to March 2004

FYI: Resources for splitting up compound words

From: Yvette P. Hoitink <y.p.hoitink@heritas.nl>
Date: Fri, 20 Feb 2004 00:18:54 +0100
To: "'WAI WCAG List'" <w3c-wai-gl@w3.org>
Message-Id: <E1AtxRZ-0002vg-D1@smtp2.home.nl>

Hello list,

During the telecon I mentioned the problem of compound words in Dutch and
other Germanic languages. In these languages, words can be created on the
fly by concatenating two (or more!) existing words. These new compounded
words cannot be found in a dictionary, they have to be split up into their
constituents which can in turn be looked up. However, this splitting up can
be difficult because sometimes multiple splits are possible. 

Because of the compound word problem, I am against requiring that the
meaning of all words can be found in an associated dictionary at level 1.

Many papers have been written about this topic by scientists in the area of
natural language processing (which happens to be my background as well).
Splitting up compound words is not a problem for which an easy algorithm can
be found, as John Slatin hoped. 

I found some online papers that discuss compound words, including splitting
them up, which I though might be useful for the people working on the
language guidelines:

Compound splitting and lexical recombination for improved performance of a
speech recognition system for German parliamentary speeches:

Empirical methods for compound splitting:

Compounds in dictionary-based cross-language information retrieval:

Corpus-driven splitting of compound words:

Analysis of Korean compound nouns using statistical information:

For the citeseer documents, you can view the papers by selecting your
preferred format in the top right corner (or follow the "PDF" link for a PDF

Yvette Hoitink
CEO Heritas, Enschede, the Netherlands
E-mail: y.p.hoitink@heritas.nl
Received on Thursday, 19 February 2004 18:19:01 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 21:07:33 UTC