W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

Re: What is a language detection algorithm?

From: KUROSAKA Teruhiko <kuro@bhlab.com>
Date: Thu, 04 Nov 2004 00:01:41 -0700
Message-ID: <4189D3D5.4040600@bhlab.com>
To: www-international@w3.org
CC: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>

One way to detect/infer a language (and character encoding as a bi-product)
is use of N-gram.  This technique make use of statistics of
particular combination of bytes that likely to be appear
in a language (and encoding).
Basis Technology for example has a product
I'm sure there are other companies and open source projects that
make use of N-gram algorithm.
KUROSAKA ("Kuro") Teruhiko, San Francisco, California, USA
Internationalization Consultant
Received on Thursday, 4 November 2004 07:01:52 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:40:49 UTC