- From: Douglas Bagnall <douglas@paradise.net.nz>
- Date: Sat, 06 Nov 2004 11:45:54 +1300
- To: www-international@w3.org
KUROSAKA Teruhiko wrote: > > One way to detect/infer a language (and character encoding as a bi-product) > is use of N-gram. I've been playing around with n-grams lately, so when I read this yesterday I wrote a simple language detection engine: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/326576 It doesn't pay attention to the character set, so would see the same text in significantly different encodings as different languages. I also discovered a language guesser by Maciej Ceglowski at http://languid.cantbedone.org/ the perl source of which is available via: http://www.idlewords.com/2004/11/source_code_for_language_guesser.htm I've had a quick look, and it seems to combine a heuristic analysis of the characters used with a sort of truncated trigram -- the reference is restricted to the 300 most common combinations in each language. The result is probably a lot quicker than a naive n-gram, and perhaps less susceptible to the noise of authorial voice, register and topic. regards Douglas Bagnall
Received on Friday, 5 November 2004 22:46:00 UTC