Re: What is a language detection algorithm? from KUROSAKA Teruhiko on 2004-11-04 (www-international@w3.org from October to December 2004)

From: KUROSAKA Teruhiko <kuro@bhlab.com>
Date: Thu, 04 Nov 2004 00:01:41 -0700
To: www-international@w3.org
CC: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>
Message-ID: <4189D3D5.4040600@bhlab.com>

One way to detect/infer a language (and character encoding as a bi-product)
is use of N-gram.  This technique make use of statistics of
particular combination of bytes that likely to be appear
in a language (and encoding).
Basis Technology for example has a product
http://www.basistech.com/language-identification/
I'm sure there are other companies and open source projects that
make use of N-gram algorithm.
-- 
KUROSAKA ("Kuro") Teruhiko, San Francisco, California, USA
Internationalization Consultant
http://www.bhlab.com/

Received on Thursday, 4 November 2004 07:01:52 UTC