RE: word breaking CJK languages

> Can any one point me to books/RFCs/websites that explain the proper
> way to break words for building a full text search database 
> when parsing
> HTML/XML in any of the following MBCS encodings:
> 
> UTF-8
> GB2312
> Shift-JIS
> EUC-KR
> Big5

The word breaking will not be dependent on the encoding, but on the scripts
I think. I mean, if you have Japanese in UTF-8 or in Shift-JIS, it does not
matter: you'll still have to do the same thing (N-grams or morphological
analysis). Same for English words in the ASCII subset of both encodings: the
words will be broken according to English rules (or generic but simpler rule
like "on white space and punctuation"---yes, I know it's simplistic).

Please summarize the info you get to the list. Thanks!

YA

PS: the Unicode mailing list from the Unicode Consortium (Internet Keyword:
Unicode Consortium) may be of help too.

Received on Wednesday, 12 April 2000 19:09:00 UTC