Re: What is a language detection algorithm? from KUROSAKA Teruhiko on 2004-11-11 (www-international@w3.org from October to December 2004)

From: KUROSAKA Teruhiko <kuro@bhlab.com>
Date: Thu, 11 Nov 2004 14:39:59 -0800
To: Frank Yung-Fong Tang <ytang0648@aol.com>
CC: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>, www-international@w3.org
Message-ID: <4193EA3F.3010306@bhlab.com>

Frank,

> Not really. A 4-gram in byte space will be ~ half a pair of DBCS 
> characters and ~ half of one DBCS characters with one trail byte of the 
> previous character and a lead byte of the next character. This won'b be 
> a big deal in the case of Shift-JIS or Big5 since their lead byte and 
> trail byte use quite different range. However for GB2312, EUC-KR, 
> EUC-TW, EUC-JP, it is big problem since the lead byte and trail byte for 
> the most common characters are using the same range (0xa1-0xfe).

The byte-based N-gram algorithm is based on the statistical
byte patters, and it does not need to understand the character
boundaries at all.  For the purpose of N-gram, the Japanese
text in EUC-JP and Japanese text in Shift_JIS are treated
as though they were different languages.  In other words,
the algorithm detects a language and the character encoding
combination.

-- 
KUROSAKA ("Kuro") Teruhiko, San Francisco, California, USA
Internationalization Consultant
http://www.bhlab.com/

Received on Thursday, 11 November 2004 22:40:30 UTC