W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

Re: What is a language detection algorithm?

From: Frank Yung-Fong Tang <ytang0648@aol.com>
Date: Fri, 5 Nov 2004 18:19:45 -0500
To: smj <smj1@sndi.net>, www-international@w3.org
cc: kuro@bhlab.com
Message-ID: <418C0A91.4060400@aol.com>

BTW, the Basis Rosette Language Identifier (see 
http://www.basistech.com/language-identification/ ) may fulfill what you 
want. I have not try it by myself personally. Maybe Kuro-san [KUROSAKA 
Teruhiko <kuro@bhlab.com> ] can help you about that.


smj wrote on 11/5/2004, 4:52 PM:

 > Thank you Frank. You have been very informative.
 >
 > According to http://babelfish.altavista.com/tr
 > "hello" in English means "こんにちは" in Japanese.
 > and the same "こんにちは" in Japanese means "Today" in English.
 >
 > That is nice. But it is not what I am after.
 >
 > I simply wanted to know if I typed in, or pasted from a copy-n-paste, or
 > programmed in, the letter (or glyph or whatever it is) "こ"
 > what language it is in.
 >
 > Like this: "こ" = Japanese Shift-JIS, or Japanese EUC-JP, or whatever
 > it is.
 >
 > Are there reference tables that could help with this?
 >
 > James
 > smj1@sndi.net
 >
 >
 >
 > ----- Original Message -----
 > From: "Frank Yung-Fong Tang" <ytang0648@aol.com>
 > To: <aphillips@webmethods.com>
 > Cc: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>;
 > <www-international@w3.org>
 > Sent: Friday, November 05, 2004 10:30 AM
 > Subject: RE: What is a language detection algorithm?
 >
 >
 > The charset detection stuff Phillips mentioned in Mozilla is not for
 > language detection but for charset detection.
 >
 > A good article about 'language detection' you can read is
 > Linguini: Language Identifiction for Multilingual Documents, Prager,
 > John M. Journal of Managment Information Systems, Winter 1999-2000.
 > Vol. 16, No 3. pp 71-101.
 >
 > However, in that paper, the author conclude the same method could be
 > used for Asian language which use multibyte encoding. I disagree with
 > that. The reason is because the only multibyte encodings he exam for
 > that paper are
 >
 > Korean EUC-KR
 > Japanese Shift-JIS
 > Chinese Big5
 >
 > The encoding structure between these three are very different.
 > Therefore, it won't hard to distinguished between these three. However,
 > once you consider the following, I believe it will be hard to detect
 > between them
 >
 > Chinese GB2312
 > Chinese GBK
 > Chinese GB18030
 > Japanese EUC-JP
 >
 > The other non-free language detection implementation you may find is
 > from Alis. Netscape 6.0-6.1 (don't remember do we use it for 6.2 or not)
 > use the detector from Alis. As I understand, the root of Alis's work is
 > from University of Montereel (sorry for misspelling) and probably also
 > use the N-gram model.
 >
 > Addison Phillips [wM] wrote on 11/3/2004, 9:20 PM:
 >
 > >
 > > (I'm making an assumption here about what you mean: hopefully it will
 > > answer your question)
 > >
 > > A language detection algorithm is a piece of software that attempts to
 > > infer the language of some textual content by examining the content
 > > alone. Generally this is necessary when one wishes to perform
 > > language-affected operations on some text and needs to know the
 > > language of the material and the information is not available from the
 > > content metadata.
 > >
 > > Examples of content metadata would include HTTP's Content-Language
 > > header, HTML's <meta> tag, XHTML's lang attribute, XML's xml:lang, and
 > > so forth. The best policy in language detection is avoidance: content
 > > should use the various metadata mechanisms available to clearly
 > > identify the language of content in order to avoid the need for
 > > language detection.
 > >
 > > In the absence of this information, certain kinds of processing may be
 > > difficult. For example, searching keywords in text requires splitting
 > > the text up into words. Some languages require special handling in
 > > order to do this. Or deciding what dictionary to use in spell-checking
 > > would be another example.
 > >
 > > In the pre-Unicode era (and, to the extent that legacy encodings are
 > > still used to encoding content), it was sometimes possible to infer
 > > some information about the language or range of possible languages
 > > from the character encoding of the content. For example, the EUC-JP
 > > encoding encodes Japanese characters and is most likely to be used to
 > > encode Japanese language text (never mind that you can encode
 > > perfectly good Russian or English with it!). Other encodings are more
 > > difficult to infer from (for example, ISO 8859-1 aka Latin-1 is widely
 > > used to encode text in several dozen languages, but it is unlikely,
 > > for example, that a Latin-1 document is written in, say, Korean). And
 > > of course Unicode encodings such as UTF-8 by themselves convey no
 > > information at all about the language of the content.
 > >
 > > Absent a hint from the encoding, most LDAs use techniques such as the
 > > relative frequency of different characters in the content. It is
 > > possible to create quite good (but never perfect) language detectors,
 > > given a sufficient amount of content to scan. Given some knowledge of
 > > the text being scanned, you can improve the accuracy of your algorithm
 > > (for example, if you know that all of the documents are French,
 > > German, or Icelandic, you can use ignore other possibilities or apply
 > > shortcuts such as using "stop lists" of common words or scanning for
 > > characters unique to each of these languages).
 > >
 > > Perversely, the most well-known open-source LDA is probably the one
 > > described here:
 > >
 > > http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
 > >
 > > As the URI implies, the goal of that particular LDA is to try and
 > > determine the character encoding used by scanning text for relative
 > > frequency of characters (expressed in this case as byte sequences)
 > > based on statistical frequency in documents in a particular range of
 > > languages.
 >
 > The N-Gram model work for European languages, but not really practicle
 > for Asian languages. This one, architected by me and implemented by
 > Shanjian Li is specific focus on how to address issues between Asian
 > languages. However, since we targeted our implementation for 'client'
 > usage, we optimize the memory usage which trad off a lot of accuracy. If
 > someone ask me to reimplement it again for server side, I will do much
 > better job :)
 >
 > >
 > > Hope this helps.
 > >
 > > Best Regards,
 > >
 > > Addison
 > >
 > > Addison P. Phillips
 > > Director, Globalization Architecture
 > > webMethods | Delivering Global Business Visibility
 > > http://www.webMethods.com
 > > Chair, W3C Internationalization (I18N) Working Group
 > > Chair, W3C-I18N-WG, Web Services Task Force
 > > http://www.w3.org/International
 > >
 > > Internationalization is an architecture.
 > > It is not a feature.
 > >
 > > > -----Original Message-----
 > > > From: www-international-request@w3.org
 > > > [mailto:www-international-request@w3.org]On Behalf Of smj (by way
 > > > of Martin Duerst <duerst@w3.org>)
 > > > Sent: 2004年11月3日 1:13
 > > > To: www-international@w3.org
 > > > Subject: What is a language detection algorithm?
 > > >
 > > >
 > > >
 > > >
 > > > What is a language detection algorithm? What does that mean and
 > > > how is it done?"
 > > >
 > > > Thanks.
 > > > <mailto:smj1@sndi.net>smj1@sndi.net
 > > >
 > >
 > >
 >
 >
 >
Received on Friday, 5 November 2004 23:20:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT