W3C home > Mailing lists > Public > www-international@w3.org > April to June 2000

RE: word breaking CJK languages

From: Jeff Halperin <jeff@basistech.com>
Date: Thu, 27 Apr 2000 17:07:30 -0400
Message-ID: <21F87348A4D9D311B4200090276AEBD91A31C7@ginza.basistech.com>
To: "'www-international@w3.org'" <www-international@w3.org>
Cc: Amy Muntz <Amy@basistech.com>
If you decide to investigate products to handle this issue, my company
offers a Chinese Morphological Analyzer and Japanese Morphological Analyzer.
Product information can be found at http://www.basistech.com/products/ .

>Can any one point me to books/RFCs/websites that explain the proper
>way to break words for building a full text search database when parsing
>HTML/XML in any of the following MBCS encodings:

>UTF-8
>GB2312
>Shift-JIS
>EUC-KR
>Big5

>Thanks,  Jeff


> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
> Jeff Halperin           One Kendall Square      Tel: 617-252-5636
> Basis Technology Corp.  Cambridge, MA 02139     Fax: 617-252-9150
> jeff@basistech.com      U.S.A.                  www.basistech.com 
> 
Received on Thursday, 27 April 2000 17:07:12 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:55 GMT