RE: What is a language detection algorithm? from Frank Yung-Fong Tang on 2004-11-05 (www-international@w3.org from October to December 2004)

From: Frank Yung-Fong Tang <ytang0648@aol.com>
Date: Fri, 5 Nov 2004 10:30:20 -0500
To: aphillips@webmethods.com
cc: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>, www-international@w3.org
Message-ID: <418B9C8C.1050302@aol.com>
The charset detection stuff Phillips mentioned in Mozilla is not for 
language detection but for charset detection.

A good article about 'language detection' you can read is
Linguini: Language Identifiction for Multilingual Documents, Prager,
John M. Journal of Managment Information Systems, Winter 1999-2000.
Vol. 16, No 3. pp 71-101.

However, in that paper, the author conclude the same method could be 
used for Asian language which use multibyte encoding. I disagree with 
that. The reason is because the only multibyte encodings he exam for 
that paper are

Korean EUC-KR
Japanese Shift-JIS
Chinese Big5

The encoding structure between these three are very different. 
Therefore, it won't hard to distinguished between these three. However, 
once you consider the following, I believe it will be hard to detect 
between them

Chinese GB2312
Chinese GBK
Chinese GB18030
Japanese EUC-JP

The other non-free language detection implementation you may find is 
from Alis. Netscape 6.0-6.1 (don't remember do we use it for 6.2 or not) 
use the detector from Alis. As I understand, the root of Alis's work is 
from University of Montereel (sorry for misspelling) and probably also 
use the N-gram model.

Addison Phillips [wM] wrote on 11/3/2004, 9:20 PM:

 >
 > (I'm making an assumption here about what you mean: hopefully it will
 > answer your question)
 >
 > A language detection algorithm is a piece of software that attempts to
 > infer the language of some textual content by examining the content
 > alone. Generally this is necessary when one wishes to perform
 > language-affected operations on some text and needs to know the
 > language of the material and the information is not available from the
 > content metadata.
 >
 > Examples of content metadata would include HTTP's Content-Language
 > header, HTML's <meta> tag, XHTML's lang attribute, XML's xml:lang, and
 > so forth. The best policy in language detection is avoidance: content
 > should use the various metadata mechanisms available to clearly
 > identify the language of content in order to avoid the need for
 > language detection.
 >
 > In the absence of this information, certain kinds of processing may be
 > difficult. For example, searching keywords in text requires splitting
 > the text up into words. Some languages require special handling in
 > order to do this. Or deciding what dictionary to use in spell-checking
 > would be another example.
 >
 > In the pre-Unicode era (and, to the extent that legacy encodings are
 > still used to encoding content), it was sometimes possible to infer
 > some information about the language or range of possible languages
 > from the character encoding of the content. For example, the EUC-JP
 > encoding encodes Japanese characters and is most likely to be used to
 > encode Japanese language text (never mind that you can encode
 > perfectly good Russian or English with it!). Other encodings are more
 > difficult to infer from (for example, ISO 8859-1 aka Latin-1 is widely
 > used to encode text in several dozen languages, but it is unlikely,
 > for example, that a Latin-1 document is written in, say, Korean). And
 > of course Unicode encodings such as UTF-8 by themselves convey no
 > information at all about the language of the content.
 >
 > Absent a hint from the encoding, most LDAs use techniques such as the
 > relative frequency of different characters in the content. It is
 > possible to create quite good (but never perfect) language detectors,
 > given a sufficient amount of content to scan. Given some knowledge of
 > the text being scanned, you can improve the accuracy of your algorithm
 > (for example, if you know that all of the documents are French,
 > German, or Icelandic, you can use ignore other possibilities or apply
 > shortcuts such as using "stop lists" of common words or scanning for
 > characters unique to each of these languages).
 >
 > Perversely, the most well-known open-source LDA is probably the one
 > described here:
 >
 > http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
 >
 > As the URI implies, the goal of that particular LDA is to try and
 > determine the character encoding used by scanning text for relative
 > frequency of characters (expressed in this case as byte sequences)
 > based on statistical frequency in documents in a particular range of
 > languages.

The N-Gram model work for European languages, but not really practicle 
for Asian languages. This one, architected by me and implemented by 
Shanjian Li is specific focus on how to address issues between Asian 
languages. However, since we targeted our implementation for 'client' 
usage, we optimize the memory usage which trad off a lot of accuracy. If 
someone ask me to reimplement it again for server side, I will do much 
better job :)

 >
 > Hope this helps.
 >
 > Best Regards,
 >
 > Addison
 >
 > Addison P. Phillips
 > Director, Globalization Architecture
 > webMethods | Delivering Global Business Visibility
 > http://www.webMethods.com
 > Chair, W3C Internationalization (I18N) Working Group
 > Chair, W3C-I18N-WG, Web Services Task Force
 > http://www.w3.org/International
 >
 > Internationalization is an architecture.
 > It is not a feature.
 >
 > > -----Original Message-----
 > > From: www-international-request@w3.org
 > > [mailto:www-international-request@w3.org]On Behalf Of smj (by way
 > > of Martin Duerst <duerst@w3.org>)
 > > Sent: 2004年11月3日 1:13
 > > To: www-international@w3.org
 > > Subject: What is a language detection algorithm?
 > >
 > >
 > >
 > >
 > > What is a language detection algorithm? What does that mean and
 > > how is it done?"
 > >
 > > Thanks.
 > > <mailto:smj1@sndi.net>smj1@sndi.net
 > >
 >
 >
Received on Friday, 5 November 2004 15:31:06 UTC