Re: charset and language of C strings? from Gregor Erbach on 1998-03-06 (www-international@w3.org from January to March 1998)

From: Gregor Erbach <gor@dfki.de>
Date: Fri, 6 Mar 1998 19:08:49 +0100 (MET)
To: www-international@w3.org, janssen@parc.xerox.com
Message-Id: <199803061808.TAA02978@leninist.dfki.uni-sb.de>

Erik van der Poel writes:
> There are organizations that have worked on systems that guess the charset
> and/or language of a piece of text. Some of those organizations have people on
> this mailing list. Maybe they will reply.

I have recently looked into this topic, and found the following:

XEROX has a language guesser based on the frequencies of
trigrams of characters:
http://www.rxrc.xerox.com/research/mltt/Tools/guesser.html

INSO provides the IntelliScope Language Recogniser as
an OEM product:
http://www.inso.com/products/oem/oemboston/html/ilrds.htm

A public-domain (GPL) PERL implementation of a language identifier
based on the frequencies of n-grams (0 < n < 6) can be obtained
from the U of Groningen:
http://grid.let.rug.nl/~vannoord/TextCat/

Two references that describe three algorithms:

[Grefenstette 1995]
Gregory Grefenstette, Comparing Two Language Identification Schemes.
In the proceedings of 3rd International Conference on Statistical
Analysis of Textual Data (JADT'95), Rome, Italy, Dec. 1995.
http://www.rxrc.xerox.com/publis/mltt/jadt/jadt.html

[Cavnar & Trenkle 1994]
William B. Cavnar and John M.Trenkle. 1994. N-Gram-Based Text Categorization. In: Symposium On Document Analysis and Information Retrieval, pages 161-176, University of Nevada, Las Vegas.
http://www.info.unicaen.fr/~giguet/classif/cavnar_trenkle_ngram.ps

regards,
Gregor Erbach

-----------------------------------------------------------------
Gregor Erbach gregor.erbach@dfki.de
DFKI GmbH phone: +49 681 302 5288
Stuhlsatzenhausweg 3 fax: +49 681 302 5338
D-66123 Saarbruecken, Germany http://www.dfki.de/~gor

Received on Friday, 6 March 1998 13:10:50 UTC