Re: What is a language detection algorithm? from Frank Yung-Fong Tang on 2004-11-05 (www-international@w3.org from October to December 2004)

From: Frank Yung-Fong Tang <ytang0648@aol.com>
Date: Fri, 5 Nov 2004 18:14:43 -0500
To: smj <smj1@sndi.net>, www-international@w3.org
Message-ID: <418C0963.1030909@aol.com>
smj wrote on 11/5/2004, 4:52 PM:

 > Thank you Frank. You have been very informative.
 >
 > According to http://babelfish.altavista.com/tr
 > "hello" in English means "こんにちは" in Japanese.
 > and the same "こんにちは" in Japanese means "Today" in English.
 >
 > That is nice. But it is not what I am after.
 >
 > I simply wanted to know if I typed in, or pasted from a copy-n-paste, or
 > programmed in, the letter (or glyph or whatever it is) "こ"
 > what language it is in.
 >
 > Like this: "こ" = Japanese Shift-JIS, or Japanese EUC-JP, or whatever
 > it is.
 >
 > Are there reference tables that could help with this?

Not really. I know what you WANT is language detection. And I understand 
you don't care much about the encoding. But since every character need 
to encoded in one for or the other, character set detection is mixed 
with langauge detection in the case you are not sure about what the 
character encoding is. If you know what the encoding is (for example, 
always in UTF-8, or always come with a reliable lable), then the 
encoding issue could be ignore. Otherwise, most of the N-Gram based 
language detection will be fooled by the EUC based encoding (GB2312, 
EUC-JP, EUC-TW and EUC-KR)

For example: The 10 bytes used to encode こんにちは in Japanese EUC-JP 
could  also mean 10 bytes of Korean in EUC-KR as ㅃㆃㅛㅑㅟ (non sense 
combination of 5 Korean characters). Unless you know the text is in 
EUC-JP or EUC-KR, you need to assume it could be EUC-JP or it could be 
EUC-KR and find out which one is more likely. Therefore, character set 
detection a lot of time is blocking an accurate detection of language 
identification. But in the mean time, if the text is not in a fixed 
encoding or come with meta data that label the encoding, there are no 
easy way to identify the charset unless using some information from the 
language level. Also, Mark up language sometimes generate a lot of noise 
to the language detection module and you may want to filter out those 
mark up which are English (or machine) oriented.

If you can reliablely know the character encoding of your data, then i 
will say a N-Gram approach is probably good enough. The other thing you 
may want to look at is the entropy rate. Text in different languages 
have different entropy rate. And that may help to identify language (or 
not). I am thinking about deeper research in this area.

BTW, I just found out that someone post a Java port of the mozilla 
alorithm that shanjian and I implemented under 
http://www.i18nfaq.com/chardet.html

I am thinking about working on a third generation of that algorithm (the 
first generation is the Parallel State Machine I implemetned, the 2nd 
generation is the one Shanjian implemented based on my idea in 2001.). 
However, I probably need to first find an i18n job in the Fairfax, VA 
area. :) Let me know if you know any company or R&D org in DC area is 
interesting spounsering the development of open source based (or not) 
character set and/or language identification algorithm...

 >
 > James
 > smj1@sndi.net
 >
 >
 >
 > ----- Original Message -----
 > From: "Frank Yung-Fong Tang" <ytang0648@aol.com>
 > To: <aphillips@webmethods.com>
 > Cc: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>;
 > <www-international@w3.org>
 > Sent: Friday, November 05, 2004 10:30 AM
 > Subject: RE: What is a language detection algorithm?
 >
 >
 > The charset detection stuff Phillips mentioned in Mozilla is not for
 > language detection but for charset detection.
 >
 > A good article about 'language detection' you can read is
 > Linguini: Language Identifiction for Multilingual Documents, Prager,
 > John M. Journal of Managment Information Systems, Winter 1999-2000.
 > Vol. 16, No 3. pp 71-101.
 >
 > However, in that paper, the author conclude the same method could be
 > used for Asian language which use multibyte encoding. I disagree with
 > that. The reason is because the only multibyte encodings he exam for
 > that paper are
 >
 > Korean EUC-KR
 > Japanese Shift-JIS
 > Chinese Big5
 >
 > The encoding structure between these three are very different.
 > Therefore, it won't hard to distinguished between these three. However,
 > once you consider the following, I believe it will be hard to detect
 > between them
 >
 > Chinese GB2312
 > Chinese GBK
 > Chinese GB18030
 > Japanese EUC-JP

I forget to add
Chinese EUC-TW

 >
 > The other non-free language detection implementation you may find is
 > from Alis. Netscape 6.0-6.1 (don't remember do we use it for 6.2 or not)
 > use the detector from Alis. As I understand, the root of Alis's work is
 > from University of Montereel (sorry for misspelling) and probably also
 > use the N-gram model.
 >
 > Addison Phillips [wM] wrote on 11/3/2004, 9:20 PM:
 >
 > >
 > > (I'm making an assumption here about what you mean: hopefully it will
 > > answer your question)
 > >
 > > A language detection algorithm is a piece of software that attempts to
 > > infer the language of some textual content by examining the content
 > > alone. Generally this is necessary when one wishes to perform
 > > language-affected operations on some text and needs to know the
 > > language of the material and the information is not available from the
 > > content metadata.
 > >
 > > Examples of content metadata would include HTTP's Content-Language
 > > header, HTML's <meta> tag, XHTML's lang attribute, XML's xml:lang, and
 > > so forth. The best policy in language detection is avoidance: content
 > > should use the various metadata mechanisms available to clearly
 > > identify the language of content in order to avoid the need for
 > > language detection.
 > >
 > > In the absence of this information, certain kinds of processing may be
 > > difficult. For example, searching keywords in text requires splitting
 > > the text up into words. Some languages require special handling in
 > > order to do this. Or deciding what dictionary to use in spell-checking
 > > would be another example.
 > >
 > > In the pre-Unicode era (and, to the extent that legacy encodings are
 > > still used to encoding content), it was sometimes possible to infer
 > > some information about the language or range of possible languages
 > > from the character encoding of the content. For example, the EUC-JP
 > > encoding encodes Japanese characters and is most likely to be used to
 > > encode Japanese language text (never mind that you can encode
 > > perfectly good Russian or English with it!). Other encodings are more
 > > difficult to infer from (for example, ISO 8859-1 aka Latin-1 is widely
 > > used to encode text in several dozen languages, but it is unlikely,
 > > for example, that a Latin-1 document is written in, say, Korean). And
 > > of course Unicode encodings such as UTF-8 by themselves convey no
 > > information at all about the language of the content.
 > >
 > > Absent a hint from the encoding, most LDAs use techniques such as the
 > > relative frequency of different characters in the content. It is
 > > possible to create quite good (but never perfect) language detectors,
 > > given a sufficient amount of content to scan. Given some knowledge of
 > > the text being scanned, you can improve the accuracy of your algorithm
 > > (for example, if you know that all of the documents are French,
 > > German, or Icelandic, you can use ignore other possibilities or apply
 > > shortcuts such as using "stop lists" of common words or scanning for
 > > characters unique to each of these languages).
 > >
 > > Perversely, the most well-known open-source LDA is probably the one
 > > described here:
 > >
 > > http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
 > >
 > > As the URI implies, the goal of that particular LDA is to try and
 > > determine the character encoding used by scanning text for relative
 > > frequency of characters (expressed in this case as byte sequences)
 > > based on statistical frequency in documents in a particular range of
 > > languages.
 >
 > The N-Gram model work for European languages, but not really practicle
 > for Asian languages. This one, architected by me and implemented by
 > Shanjian Li is specific focus on how to address issues between Asian
 > languages. However, since we targeted our implementation for 'client'
 > usage, we optimize the memory usage which trad off a lot of accuracy. If
 > someone ask me to reimplement it again for server side, I will do much
 > better job :)
 >
 > >
 > > Hope this helps.
 > >
 > > Best Regards,
 > >
 > > Addison
 > >
 > > Addison P. Phillips
 > > Director, Globalization Architecture
 > > webMethods | Delivering Global Business Visibility
 > > http://www.webMethods.com
 > > Chair, W3C Internationalization (I18N) Working Group
 > > Chair, W3C-I18N-WG, Web Services Task Force
 > > http://www.w3.org/International
 > >
 > > Internationalization is an architecture.
 > > It is not a feature.
 > >
 > > > -----Original Message-----
 > > > From: www-international-request@w3.org
 > > > [mailto:www-international-request@w3.org]On Behalf Of smj (by way
 > > > of Martin Duerst <duerst@w3.org>)
 > > > Sent: 2004年11月3日 1:13
 > > > To: www-international@w3.org
 > > > Subject: What is a language detection algorithm?
 > > >
 > > >
 > > >
 > > >
 > > > What is a language detection algorithm? What does that mean and
 > > > how is it done?"
 > > >
 > > > Thanks.
 > > > <mailto:smj1@sndi.net>smj1@sndi.net
 > > >
 > >
 > >
 >
 >
 >
Received on Friday, 5 November 2004 23:15:38 UTC