RE: What is a language detection algorithm? from Addison Phillips [wM] on 2004-11-04 (www-international@w3.org from October to December 2004)

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Wed, 3 Nov 2004 18:20:46 -0800
To: "smj \(by way of Martin Duerst <duerst@w3.org>\)" <smj1@sndi.net>, <www-international@w3.org>
Message-ID: <PNEHIBAMBMLHDMJDDFLHMEAGIMAA.aphillips@webmethods.com>

(I'm making an assumption here about what you mean: hopefully it will answer your question)

A language detection algorithm is a piece of software that attempts to infer the language of some textual content by examining the content alone. Generally this is necessary when one wishes to perform language-affected operations on some text and needs to know the language of the material and the information is not available from the content metadata.

Examples of content metadata would include HTTP's Content-Language header, HTML's <meta> tag, XHTML's lang attribute, XML's xml:lang, and so forth. The best policy in language detection is avoidance: content should use the various metadata mechanisms available to clearly identify the language of content in order to avoid the need for language detection.

In the absence of this information, certain kinds of processing may be difficult. For example, searching keywords in text requires splitting the text up into words. Some languages require special handling in order to do this. Or deciding what dictionary to use in spell-checking would be another example.

In the pre-Unicode era (and, to the extent that legacy encodings are still used to encoding content), it was sometimes possible to infer some information about the language or range of possible languages from the character encoding of the content. For example, the EUC-JP encoding encodes Japanese characters and is most likely to be used to encode Japanese language text (never mind that you can encode perfectly good Russian or English with it!). Other encodings are more difficult to infer from (for example, ISO 8859-1 aka Latin-1 is widely used to encode text in several dozen languages, but it is unlikely, for example, that a Latin-1 document is written in, say, Korean). And of course Unicode encodings such as UTF-8 by themselves convey no information at all about the language of the content.

Absent a hint from the encoding, most LDAs use techniques such as the relative frequency of different characters in the content. It is possible to create quite good (but never perfect) language detectors, given a sufficient amount of content to scan. Given some knowledge of the text being scanned, you can improve the accuracy of your algorithm (for example, if you know that all of the documents are French, German, or Icelandic, you can use ignore other possibilities or apply shortcuts such as using "stop lists" of common words or scanning for characters unique to each of these languages).

Perversely, the most well-known open-source LDA is probably the one described here:

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

As the URI implies, the goal of that particular LDA is to try and determine the character encoding used by scanning text for relative frequency of characters (expressed in this case as byte sequences) based on statistical frequency in documents in a particular range of languages.

Hope this helps.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -----Original Message-----
> From: www-international-request@w3.org 
> [mailto:www-international-request@w3.org]On Behalf Of smj (by way 
> of Martin Duerst <duerst@w3.org>)
> Sent: 2004年11月3日 1:13
> To: www-international@w3.org
> Subject: What is a language detection algorithm?
> 
> 
> 
> 
> What is a language detection algorithm? What does that mean and 
> how is it done?"
> 
> Thanks.
> <mailto:smj1@sndi.net>smj1@sndi.net
>

Received on Thursday, 4 November 2004 02:21:05 UTC