W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

Re: What is a language detection algorithm?

From: KUROSAKA Teruhiko <kuro@bhlab.com>
Date: Thu, 04 Nov 2004 00:01:41 -0700
Message-ID: <4189D3D5.4040600@bhlab.com>
To: www-international@w3.org
CC: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>

One way to detect/infer a language (and character encoding as a bi-product)
is use of N-gram.  This technique make use of statistics of
particular combination of bytes that likely to be appear
in a language (and encoding).
Basis Technology for example has a product
http://www.basistech.com/language-identification/
I'm sure there are other companies and open source projects that
make use of N-gram algorithm.
-- 
KUROSAKA ("Kuro") Teruhiko, San Francisco, California, USA
Internationalization Consultant
http://www.bhlab.com/
Received on Thursday, 4 November 2004 07:01:52 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT