- From: Frank Yung-Fong Tang <ytang0648@aol.com>
- Date: Thu, 11 Nov 2004 18:00:07 -0500
- To: kuro@sonic.net
- cc: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>, www-international@w3.org
KUROSAKA Teruhiko wrote on 11/11/2004, 5:39 PM:
> Frank,
>
> > Not really. A 4-gram in byte space will be ~ half a pair of DBCS
> > characters and ~ half of one DBCS characters with one trail byte of the
> > previous character and a lead byte of the next character. This won'b be
> > a big deal in the case of Shift-JIS or Big5 since their lead byte and
> > trail byte use quite different range. However for GB2312, EUC-KR,
> > EUC-TW, EUC-JP, it is big problem since the lead byte and trail byte
> for
> > the most common characters are using the same range (0xa1-0xfe).
>
> The byte-based N-gram algorithm is based on the statistical
> byte patters, and it does not need to understand the character
> boundaries at all. For the purpose of N-gram, the Japanese
> text in EUC-JP and Japanese text in Shift_JIS are treated
> as though they were different languages. In other words,
> the algorithm detects a language and the character encoding
> combination.
Well. I knew what you said but I don't buy that. I knew the 4-Gram will
treat Shift_JIS and EUC-JP differently. I knew it came from different
model sets. But what I said is because the 4-gram do not know the
character boundary, it will bring in 50% of the noise when it build the
statistics. Some people may believe it does not matter but I believe
it does. The problem won't surface if you try to detect between
Shift_JIS and EUC-JP because their encoding sturcture are very
different. But when you try to distingush between EUC-JP, GB2312,
EUC-KR, and EUC-TW the problem will show up. Also, detecting between
ISO-2022-JP, ISO-2022-CN, and ISO-2022-KR. (both bytes fall into
0x21-0x7e range).
What I believe is a system which "does not need to understand the
character boundaries at all" won't give good result between EUC-JP,
GB2312, EUC-KR, and EUC-TW. It however, is probably useful to detect all
single byte encoding, DBCS which the lead byte and trail byte do not
share the same range (for example, Shift_JIS, Big5). If you have such a
system, I am almost certain that the precision and recall rates of
EUC-JP, GB2312, EUC-KR, and EUC-TW are less than the rate of Shift_JIS,
Big5, UTF-8, and all other single byte encodings.
> --
> KUROSAKA ("Kuro") Teruhiko, San Francisco, California, USA
> Internationalization Consultant
> http://www.bhlab.com/
>
Received on Thursday, 11 November 2004 23:00:45 UTC