- From: Gregor Erbach <gor@dfki.de>
- Date: Fri, 6 Mar 1998 19:08:49 +0100 (MET)
- To: www-international@w3.org, janssen@parc.xerox.com
Erik van der Poel writes: > There are organizations that have worked on systems that guess the charset > and/or language of a piece of text. Some of those organizations have people on > this mailing list. Maybe they will reply. I have recently looked into this topic, and found the following: XEROX has a language guesser based on the frequencies of trigrams of characters: http://www.rxrc.xerox.com/research/mltt/Tools/guesser.html INSO provides the IntelliScope Language Recogniser as an OEM product: http://www.inso.com/products/oem/oemboston/html/ilrds.htm A public-domain (GPL) PERL implementation of a language identifier based on the frequencies of n-grams (0 < n < 6) can be obtained from the U of Groningen: http://grid.let.rug.nl/~vannoord/TextCat/ Two references that describe three algorithms: [Grefenstette 1995] Gregory Grefenstette, Comparing Two Language Identification Schemes. In the proceedings of 3rd International Conference on Statistical Analysis of Textual Data (JADT'95), Rome, Italy, Dec. 1995. http://www.rxrc.xerox.com/publis/mltt/jadt/jadt.html [Cavnar & Trenkle 1994] William B. Cavnar and John M.Trenkle. 1994. N-Gram-Based Text Categorization. In: Symposium On Document Analysis and Information Retrieval, pages 161-176, University of Nevada, Las Vegas. http://www.info.unicaen.fr/~giguet/classif/cavnar_trenkle_ngram.ps regards, Gregor Erbach ----------------------------------------------------------------- Gregor Erbach gregor.erbach@dfki.de DFKI GmbH phone: +49 681 302 5288 Stuhlsatzenhausweg 3 fax: +49 681 302 5338 D-66123 Saarbruecken, Germany http://www.dfki.de/~gor
Received on Friday, 6 March 1998 13:10:50 UTC