- From: Frank Yung-Fong Tang <ytang0648@aol.com>
- Date: Thu, 11 Nov 2004 18:00:07 -0500
- To: kuro@sonic.net
- cc: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>, www-international@w3.org
KUROSAKA Teruhiko wrote on 11/11/2004, 5:39 PM: > Frank, > > > Not really. A 4-gram in byte space will be ~ half a pair of DBCS > > characters and ~ half of one DBCS characters with one trail byte of the > > previous character and a lead byte of the next character. This won'b be > > a big deal in the case of Shift-JIS or Big5 since their lead byte and > > trail byte use quite different range. However for GB2312, EUC-KR, > > EUC-TW, EUC-JP, it is big problem since the lead byte and trail byte > for > > the most common characters are using the same range (0xa1-0xfe). > > The byte-based N-gram algorithm is based on the statistical > byte patters, and it does not need to understand the character > boundaries at all. For the purpose of N-gram, the Japanese > text in EUC-JP and Japanese text in Shift_JIS are treated > as though they were different languages. In other words, > the algorithm detects a language and the character encoding > combination. Well. I knew what you said but I don't buy that. I knew the 4-Gram will treat Shift_JIS and EUC-JP differently. I knew it came from different model sets. But what I said is because the 4-gram do not know the character boundary, it will bring in 50% of the noise when it build the statistics. Some people may believe it does not matter but I believe it does. The problem won't surface if you try to detect between Shift_JIS and EUC-JP because their encoding sturcture are very different. But when you try to distingush between EUC-JP, GB2312, EUC-KR, and EUC-TW the problem will show up. Also, detecting between ISO-2022-JP, ISO-2022-CN, and ISO-2022-KR. (both bytes fall into 0x21-0x7e range). What I believe is a system which "does not need to understand the character boundaries at all" won't give good result between EUC-JP, GB2312, EUC-KR, and EUC-TW. It however, is probably useful to detect all single byte encoding, DBCS which the lead byte and trail byte do not share the same range (for example, Shift_JIS, Big5). If you have such a system, I am almost certain that the precision and recall rates of EUC-JP, GB2312, EUC-KR, and EUC-TW are less than the rate of Shift_JIS, Big5, UTF-8, and all other single byte encodings. > -- > KUROSAKA ("Kuro") Teruhiko, San Francisco, California, USA > Internationalization Consultant > http://www.bhlab.com/ >
Received on Thursday, 11 November 2004 23:00:45 UTC