Frank, > Not really. A 4-gram in byte space will be ~ half a pair of DBCS > characters and ~ half of one DBCS characters with one trail byte of the > previous character and a lead byte of the next character. This won'b be > a big deal in the case of Shift-JIS or Big5 since their lead byte and > trail byte use quite different range. However for GB2312, EUC-KR, > EUC-TW, EUC-JP, it is big problem since the lead byte and trail byte for > the most common characters are using the same range (0xa1-0xfe). The byte-based N-gram algorithm is based on the statistical byte patters, and it does not need to understand the character boundaries at all. For the purpose of N-gram, the Japanese text in EUC-JP and Japanese text in Shift_JIS are treated as though they were different languages. In other words, the algorithm detects a language and the character encoding combination. -- KUROSAKA ("Kuro") Teruhiko, San Francisco, California, USA Internationalization Consultant http://www.bhlab.com/Received on Thursday, 11 November 2004 22:40:30 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT