- From: Frank Yung-Fong Tang <ytang0648@aol.com>
- Date: Fri, 5 Nov 2004 18:19:45 -0500
- To: smj <smj1@sndi.net>, www-international@w3.org
- cc: kuro@bhlab.com
BTW, the Basis Rosette Language Identifier (see http://www.basistech.com/language-identification/ ) may fulfill what you want. I have not try it by myself personally. Maybe Kuro-san [KUROSAKA Teruhiko <kuro@bhlab.com> ] can help you about that. smj wrote on 11/5/2004, 4:52 PM: > Thank you Frank. You have been very informative. > > According to http://babelfish.altavista.com/tr > "hello" in English means "こんにちは" in Japanese. > and the same "こんにちは" in Japanese means "Today" in English. > > That is nice. But it is not what I am after. > > I simply wanted to know if I typed in, or pasted from a copy-n-paste, or > programmed in, the letter (or glyph or whatever it is) "こ" > what language it is in. > > Like this: "こ" = Japanese Shift-JIS, or Japanese EUC-JP, or whatever > it is. > > Are there reference tables that could help with this? > > James > smj1@sndi.net > > > > ----- Original Message ----- > From: "Frank Yung-Fong Tang" <ytang0648@aol.com> > To: <aphillips@webmethods.com> > Cc: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>; > <www-international@w3.org> > Sent: Friday, November 05, 2004 10:30 AM > Subject: RE: What is a language detection algorithm? > > > The charset detection stuff Phillips mentioned in Mozilla is not for > language detection but for charset detection. > > A good article about 'language detection' you can read is > Linguini: Language Identifiction for Multilingual Documents, Prager, > John M. Journal of Managment Information Systems, Winter 1999-2000. > Vol. 16, No 3. pp 71-101. > > However, in that paper, the author conclude the same method could be > used for Asian language which use multibyte encoding. I disagree with > that. The reason is because the only multibyte encodings he exam for > that paper are > > Korean EUC-KR > Japanese Shift-JIS > Chinese Big5 > > The encoding structure between these three are very different. > Therefore, it won't hard to distinguished between these three. However, > once you consider the following, I believe it will be hard to detect > between them > > Chinese GB2312 > Chinese GBK > Chinese GB18030 > Japanese EUC-JP > > The other non-free language detection implementation you may find is > from Alis. Netscape 6.0-6.1 (don't remember do we use it for 6.2 or not) > use the detector from Alis. As I understand, the root of Alis's work is > from University of Montereel (sorry for misspelling) and probably also > use the N-gram model. > > Addison Phillips [wM] wrote on 11/3/2004, 9:20 PM: > > > > > (I'm making an assumption here about what you mean: hopefully it will > > answer your question) > > > > A language detection algorithm is a piece of software that attempts to > > infer the language of some textual content by examining the content > > alone. Generally this is necessary when one wishes to perform > > language-affected operations on some text and needs to know the > > language of the material and the information is not available from the > > content metadata. > > > > Examples of content metadata would include HTTP's Content-Language > > header, HTML's <meta> tag, XHTML's lang attribute, XML's xml:lang, and > > so forth. The best policy in language detection is avoidance: content > > should use the various metadata mechanisms available to clearly > > identify the language of content in order to avoid the need for > > language detection. > > > > In the absence of this information, certain kinds of processing may be > > difficult. For example, searching keywords in text requires splitting > > the text up into words. Some languages require special handling in > > order to do this. Or deciding what dictionary to use in spell-checking > > would be another example. > > > > In the pre-Unicode era (and, to the extent that legacy encodings are > > still used to encoding content), it was sometimes possible to infer > > some information about the language or range of possible languages > > from the character encoding of the content. For example, the EUC-JP > > encoding encodes Japanese characters and is most likely to be used to > > encode Japanese language text (never mind that you can encode > > perfectly good Russian or English with it!). Other encodings are more > > difficult to infer from (for example, ISO 8859-1 aka Latin-1 is widely > > used to encode text in several dozen languages, but it is unlikely, > > for example, that a Latin-1 document is written in, say, Korean). And > > of course Unicode encodings such as UTF-8 by themselves convey no > > information at all about the language of the content. > > > > Absent a hint from the encoding, most LDAs use techniques such as the > > relative frequency of different characters in the content. It is > > possible to create quite good (but never perfect) language detectors, > > given a sufficient amount of content to scan. Given some knowledge of > > the text being scanned, you can improve the accuracy of your algorithm > > (for example, if you know that all of the documents are French, > > German, or Icelandic, you can use ignore other possibilities or apply > > shortcuts such as using "stop lists" of common words or scanning for > > characters unique to each of these languages). > > > > Perversely, the most well-known open-source LDA is probably the one > > described here: > > > > http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html > > > > As the URI implies, the goal of that particular LDA is to try and > > determine the character encoding used by scanning text for relative > > frequency of characters (expressed in this case as byte sequences) > > based on statistical frequency in documents in a particular range of > > languages. > > The N-Gram model work for European languages, but not really practicle > for Asian languages. This one, architected by me and implemented by > Shanjian Li is specific focus on how to address issues between Asian > languages. However, since we targeted our implementation for 'client' > usage, we optimize the memory usage which trad off a lot of accuracy. If > someone ask me to reimplement it again for server side, I will do much > better job :) > > > > > Hope this helps. > > > > Best Regards, > > > > Addison > > > > Addison P. Phillips > > Director, Globalization Architecture > > webMethods | Delivering Global Business Visibility > > http://www.webMethods.com > > Chair, W3C Internationalization (I18N) Working Group > > Chair, W3C-I18N-WG, Web Services Task Force > > http://www.w3.org/International > > > > Internationalization is an architecture. > > It is not a feature. > > > > > -----Original Message----- > > > From: www-international-request@w3.org > > > [mailto:www-international-request@w3.org]On Behalf Of smj (by way > > > of Martin Duerst <duerst@w3.org>) > > > Sent: 2004年11月3日 1:13 > > > To: www-international@w3.org > > > Subject: What is a language detection algorithm? > > > > > > > > > > > > > > > What is a language detection algorithm? What does that mean and > > > how is it done?" > > > > > > Thanks. > > > <mailto:smj1@sndi.net>smj1@sndi.net > > > > > > > > > >
Received on Friday, 5 November 2004 23:20:29 UTC