W3C home > Mailing lists > Public > www-international@w3.org > January to March 1998

Re: charset and language of C strings?

From: Gregor Erbach <gor@dfki.de>
Date: Fri, 6 Mar 1998 19:08:49 +0100 (MET)
Message-Id: <199803061808.TAA02978@leninist.dfki.uni-sb.de>
To: www-international@w3.org, janssen@parc.xerox.com
Erik van der Poel writes:
> There are organizations that have worked on systems that guess the charset
> and/or language of a piece of text. Some of those organizations have people on
> this mailing list. Maybe they will reply.

I have recently looked into this topic, and found the following:

XEROX has a language guesser based on the frequencies of
trigrams of characters:

INSO provides the IntelliScope Language Recogniser as 
an OEM product:

A public-domain (GPL) PERL implementation of a language identifier
based on the frequencies of n-grams (0 < n < 6) can be obtained
from the U of Groningen:

Two references that describe three algorithms:

[Grefenstette 1995]
Gregory Grefenstette, Comparing Two Language Identification Schemes. 
In the proceedings of 3rd International Conference on Statistical 
Analysis of Textual Data (JADT'95), Rome, Italy, Dec. 1995.

[Cavnar & Trenkle 1994]
William B. Cavnar and John M.Trenkle. 1994. N-Gram-Based Text Categorization. In: Symposium On Document Analysis and Information Retrieval, pages 161-176, University of Nevada, Las Vegas.

    Gregor Erbach

Gregor Erbach                               gregor.erbach@dfki.de
DFKI GmbH                                 phone: +49 681 302 5288
Stuhlsatzenhausweg 3                        fax: +49 681 302 5338
D-66123 Saarbruecken, Germany             http://www.dfki.de/~gor
Received on Friday, 6 March 1998 13:10:50 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:18 UTC