- From: Yves Arrouye <yves@realnames.com>
- Date: Wed, 12 Apr 2000 16:08:47 -0700
- To: "'Stockett, Jeff'" <stockett@quadralay.com>, "'www-international@w3.org'" <www-international@w3.org>
> Can any one point me to books/RFCs/websites that explain the proper > way to break words for building a full text search database > when parsing > HTML/XML in any of the following MBCS encodings: > > UTF-8 > GB2312 > Shift-JIS > EUC-KR > Big5 The word breaking will not be dependent on the encoding, but on the scripts I think. I mean, if you have Japanese in UTF-8 or in Shift-JIS, it does not matter: you'll still have to do the same thing (N-grams or morphological analysis). Same for English words in the ASCII subset of both encodings: the words will be broken according to English rules (or generic but simpler rule like "on white space and punctuation"---yes, I know it's simplistic). Please summarize the info you get to the list. Thanks! YA PS: the Unicode mailing list from the Unicode Consortium (Internet Keyword: Unicode Consortium) may be of help too.
Received on Wednesday, 12 April 2000 19:09:00 UTC