Re: charset and language of C strings?

Erik van der Poel writes:
> There are organizations that have worked on systems that guess the charset
> and/or language of a piece of text. Some of those organizations have people on
> this mailing list. Maybe they will reply.

I have recently looked into this topic, and found the following:

XEROX has a language guesser based on the frequencies of
trigrams of characters:
http://www.rxrc.xerox.com/research/mltt/Tools/guesser.html

INSO provides the IntelliScope Language Recogniser as 
an OEM product:
http://www.inso.com/products/oem/oemboston/html/ilrds.htm

A public-domain (GPL) PERL implementation of a language identifier
based on the frequencies of n-grams (0 < n < 6) can be obtained
from the U of Groningen:
http://grid.let.rug.nl/~vannoord/TextCat/

Two references that describe three algorithms:

[Grefenstette 1995]
Gregory Grefenstette, Comparing Two Language Identification Schemes. 
In the proceedings of 3rd International Conference on Statistical 
Analysis of Textual Data (JADT'95), Rome, Italy, Dec. 1995.
http://www.rxrc.xerox.com/publis/mltt/jadt/jadt.html

[Cavnar & Trenkle 1994]
William B. Cavnar and John M.Trenkle. 1994. N-Gram-Based Text Categorization. In: Symposium On Document Analysis and Information Retrieval, pages 161-176, University of Nevada, Las Vegas.
http://www.info.unicaen.fr/~giguet/classif/cavnar_trenkle_ngram.ps

regards,
    Gregor Erbach

-----------------------------------------------------------------
Gregor Erbach                               gregor.erbach@dfki.de
DFKI GmbH                                 phone: +49 681 302 5288
Stuhlsatzenhausweg 3                        fax: +49 681 302 5338
D-66123 Saarbruecken, Germany             http://www.dfki.de/~gor

Received on Friday, 6 March 1998 13:10:50 UTC