Re: What is a language detection algorithm? from Douglas Bagnall on 2004-11-05 (www-international@w3.org from October to December 2004)

From: Douglas Bagnall <douglas@paradise.net.nz>
Date: Sat, 06 Nov 2004 11:45:54 +1300
To: www-international@w3.org
Message-id: <418C02A2.6000400@paradise.net.nz>

KUROSAKA Teruhiko wrote:
> 
> One way to detect/infer a language (and character encoding as a bi-product)
> is use of N-gram.  

I've been playing around with n-grams lately, so when I read this 
yesterday I wrote a simple language detection engine:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/326576

It doesn't pay attention to the character set, so would see the same 
text in significantly different encodings as different languages.

I also discovered  a language guesser by Maciej Ceglowski at

http://languid.cantbedone.org/

the perl source of which is available via:

http://www.idlewords.com/2004/11/source_code_for_language_guesser.htm

I've had a quick look, and it seems to combine a heuristic analysis of 
the characters used with a sort of truncated trigram -- the reference is 
restricted to the 300 most common combinations in each language.  The 
result is probably a lot quicker than a naive n-gram, and perhaps less 
susceptible to the noise of authorial voice, register and topic.

regards

Douglas Bagnall

Received on Friday, 5 November 2004 22:46:00 UTC