KUROSAKA Teruhiko wrote: > > One way to detect/infer a language (and character encoding as a bi-product) > is use of N-gram. I've been playing around with n-grams lately, so when I read this yesterday I wrote a simple language detection engine: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/326576 It doesn't pay attention to the character set, so would see the same text in significantly different encodings as different languages. I also discovered a language guesser by Maciej Ceglowski at http://languid.cantbedone.org/ the perl source of which is available via: http://www.idlewords.com/2004/11/source_code_for_language_guesser.htm I've had a quick look, and it seems to combine a heuristic analysis of the characters used with a sort of truncated trigram -- the reference is restricted to the 300 most common combinations in each language. The result is probably a lot quicker than a naive n-gram, and perhaps less susceptible to the noise of authorial voice, register and topic. regards Douglas BagnallReceived on Friday, 5 November 2004 22:46:00 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT