W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

Re: What is a language detection algorithm?

From: Douglas Bagnall <douglas@paradise.net.nz>
Date: Sat, 06 Nov 2004 11:45:54 +1300
To: www-international@w3.org
Message-id: <418C02A2.6000400@paradise.net.nz>

KUROSAKA Teruhiko wrote:
> One way to detect/infer a language (and character encoding as a bi-product)
> is use of N-gram.  

I've been playing around with n-grams lately, so when I read this 
yesterday I wrote a simple language detection engine:


It doesn't pay attention to the character set, so would see the same 
text in significantly different encodings as different languages.

I also discovered  a language guesser by Maciej Ceglowski at


the perl source of which is available via:


I've had a quick look, and it seems to combine a heuristic analysis of 
the characters used with a sort of truncated trigram -- the reference is 
restricted to the 300 most common combinations in each language.  The 
result is probably a lot quicker than a naive n-gram, and perhaps less 
susceptible to the noise of authorial voice, register and topic.


Douglas Bagnall
Received on Friday, 5 November 2004 22:46:00 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:24 UTC