- From: KUROSAKA Teruhiko <kuro@bhlab.com>
- Date: Thu, 11 Nov 2004 14:39:59 -0800
- To: Frank Yung-Fong Tang <ytang0648@aol.com>
- CC: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>, www-international@w3.org
Frank, > Not really. A 4-gram in byte space will be ~ half a pair of DBCS > characters and ~ half of one DBCS characters with one trail byte of the > previous character and a lead byte of the next character. This won'b be > a big deal in the case of Shift-JIS or Big5 since their lead byte and > trail byte use quite different range. However for GB2312, EUC-KR, > EUC-TW, EUC-JP, it is big problem since the lead byte and trail byte for > the most common characters are using the same range (0xa1-0xfe). The byte-based N-gram algorithm is based on the statistical byte patters, and it does not need to understand the character boundaries at all. For the purpose of N-gram, the Japanese text in EUC-JP and Japanese text in Shift_JIS are treated as though they were different languages. In other words, the algorithm detects a language and the character encoding combination. -- KUROSAKA ("Kuro") Teruhiko, San Francisco, California, USA Internationalization Consultant http://www.bhlab.com/
Received on Thursday, 11 November 2004 22:40:30 UTC