RE: word breaking CJK languages from Yves Arrouye on 2000-04-12 (www-international@w3.org from April to June 2000)

From: Yves Arrouye <yves@realnames.com>
Date: Wed, 12 Apr 2000 16:08:47 -0700
To: "'Stockett, Jeff'" <stockett@quadralay.com>, "'www-international@w3.org'" <www-international@w3.org>
Message-ID: <7D28C07629C9D211A1CC00500403ADD402135D1B@email.centraal.com>

> Can any one point me to books/RFCs/websites that explain the proper
> way to break words for building a full text search database 
> when parsing
> HTML/XML in any of the following MBCS encodings:
> 
> UTF-8
> GB2312
> Shift-JIS
> EUC-KR
> Big5

The word breaking will not be dependent on the encoding, but on the scripts
I think. I mean, if you have Japanese in UTF-8 or in Shift-JIS, it does not
matter: you'll still have to do the same thing (N-grams or morphological
analysis). Same for English words in the ASCII subset of both encodings: the
words will be broken according to English rules (or generic but simpler rule
like "on white space and punctuation"---yes, I know it's simplistic).

Please summarize the info you get to the list. Thanks!

YA

PS: the Unicode mailing list from the Unicode Consortium (Internet Keyword:
Unicode Consortium) may be of help too.

Received on Wednesday, 12 April 2000 19:09:00 UTC