W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

RE: What is a language detection algorithm?

From: Frank Yung-Fong Tang <ytang0648@aol.com>
Date: Thu, 11 Nov 2004 15:58:33 -0500
To: "Asmus Freytag" <asmusf@ix.netcom.com>
cc: aphillips@webmethods.com, "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>, www-international@w3.org
Message-ID: <4193D279.9020008@aol.com>



Asmus Freytag wrote on 11/9/2004, 1:55 AM:

 > Note that their approach used n-grams in byte space. A 4-gram would be
 > just
 > a pair of DBCS characters, a 2-gram would effectively be a frequency
 > table.

Not really. A 4-gram in byte space will be ~ half a pair of DBCS 
characters and ~ half of one DBCS characters with one trail byte of the 
previous character and a lead byte of the next character. This won'b be 
a big deal in the case of Shift-JIS or Big5 since their lead byte and 
trail byte use quite different range. However for GB2312, EUC-KR, 
EUC-TW, EUC-JP, it is big problem since the lead byte and trail byte for 
the most common characters are using the same range (0xa1-0xfe).
Received on Thursday, 11 November 2004 20:59:13 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT