Re: What is a language detection algorithm? from Frank Yung-Fong Tang on 2004-11-11 (www-international@w3.org from October to December 2004)

From: Frank Yung-Fong Tang <ytang0648@aol.com>
Date: Thu, 11 Nov 2004 18:00:07 -0500
To: kuro@sonic.net
cc: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>, www-international@w3.org
Message-ID: <4193EEF7.10805@aol.com>

KUROSAKA Teruhiko wrote on 11/11/2004, 5:39 PM:

 > Frank,
 >
 > > Not really. A 4-gram in byte space will be ~ half a pair of DBCS
 > > characters and ~ half of one DBCS characters with one trail byte of the
 > > previous character and a lead byte of the next character. This won'b be
 > > a big deal in the case of Shift-JIS or Big5 since their lead byte and
 > > trail byte use quite different range. However for GB2312, EUC-KR,
 > > EUC-TW, EUC-JP, it is big problem since the lead byte and trail byte
 > for
 > > the most common characters are using the same range (0xa1-0xfe).
 >
 > The byte-based N-gram algorithm is based on the statistical
 > byte patters, and it does not need to understand the character
 > boundaries at all.  For the purpose of N-gram, the Japanese
 > text in EUC-JP and Japanese text in Shift_JIS are treated
 > as though they were different languages.  In other words,
 > the algorithm detects a language and the character encoding
 > combination.

Well. I knew what you said but I don't buy that. I knew the 4-Gram will 
treat Shift_JIS and EUC-JP differently. I knew it came from different 
model sets. But what I said is because the 4-gram do not know the 
character boundary, it will bring in 50% of the noise when it build the 
  statistics. Some people may believe it does not matter but I believe 
it does. The problem won't surface if you try to detect between 
Shift_JIS and EUC-JP because their encoding sturcture are very 
different. But when you try to distingush between EUC-JP, GB2312, 
EUC-KR, and EUC-TW the problem will show up. Also, detecting between 
ISO-2022-JP, ISO-2022-CN, and ISO-2022-KR. (both bytes fall into 
0x21-0x7e range).

What I believe is a system which "does not need to understand the 
character boundaries at all" won't give good result between EUC-JP, 
GB2312, EUC-KR, and EUC-TW. It however, is probably useful to detect all 
single byte encoding, DBCS which the lead byte and trail byte do not 
share the same range (for example, Shift_JIS, Big5). If you have such a 
system, I am almost certain that the precision and recall rates of 
EUC-JP, GB2312, EUC-KR, and EUC-TW are less than the rate of Shift_JIS, 
Big5, UTF-8, and all other single byte encodings.



 > --
 > KUROSAKA ("Kuro") Teruhiko, San Francisco, California, USA
 > Internationalization Consultant
 > http://www.bhlab.com/
 >

Received on Thursday, 11 November 2004 23:00:45 UTC