- From: KUROSAKA Teruhiko <kuro@bhlab.com>
- Date: Thu, 04 Nov 2004 00:01:41 -0700
- To: www-international@w3.org
- CC: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>
One way to detect/infer a language (and character encoding as a bi-product) is use of N-gram. This technique make use of statistics of particular combination of bytes that likely to be appear in a language (and encoding). Basis Technology for example has a product http://www.basistech.com/language-identification/ I'm sure there are other companies and open source projects that make use of N-gram algorithm. -- KUROSAKA ("Kuro") Teruhiko, San Francisco, California, USA Internationalization Consultant http://www.bhlab.com/
Received on Thursday, 4 November 2004 07:01:52 UTC