RE: What is a language detection algorithm? from Asmus Freytag on 2004-11-09 (www-international@w3.org from October to December 2004)

From: Asmus Freytag <asmusf@ix.netcom.com>
Date: Mon, 08 Nov 2004 22:55:54 -0800
To: "Frank Yung-Fong Tang" <ytang0648@aol.com>,aphillips@webmethods.com
Cc: "smj (by way of Martin Duerst <duerst@w3.org>)" <smj1@sndi.net>, www-international@w3.org
Message-Id: <6.1.2.0.2.20041108221938.09de3eb0@pop.ix.netcom.com>
At 07:30 AM 11/5/2004, Frank Yung-Fong Tang wrote:

>The charset detection stuff Phillips mentioned in Mozilla is not for
>language detection but for charset detection.
>
>A good article about 'language detection' you can read is
>Linguini: Language Identifiction for Multilingual Documents, Prager,
>John M. Journal of Managment Information Systems, Winter 1999-2000.
>Vol. 16, No 3. pp 71-101.

The full text of a conference submission under the same title can be found 
here:

http://csdl.computer.org/comp/proceedings/hicss/1999/0001/02/00012035.PDF


>However, in that paper, the author conclude the same method could be
>used for Asian language which use multibyte encoding. I disagree with
>that. The reason is because the only multibyte encodings he exam for
>that paper are
>
>Korean EUC-KR
>Japanese Shift-JIS
>Chinese Big5
>
>The encoding structure between these three are very different.
>Therefore, it won't hard to distinguished between these three. However,
>once you consider the following, I believe it will be hard to detect
>between them
>
>Chinese GB2312
>Chinese GBK
>Chinese GB18030
>Japanese EUC-JP

For 8-bit character sets, it's been my contention (unproven) that one can
use a language analyzer to determine the pair (language, encoding) in one
go. The big exception is English, as it is represented in a common subset
(ASCII) in all of these encodings. Therefore, English with small admixtures
would likely be misrecognized.

I've experimented with some n-gram systems myself a few years ago, but never
could get the scoring to work, making automatic analysis of documents of 
varying
length impossible. But it was fine for manual evaluation.

If you print a 2-gram matrix using some way to color the cells based on
frequency, it becomes very apparent that the distribution of the colors is
very ideosyncratic. Enough so, that you could add different permutations,
into the mix, each for a different encoding. The top/left part of the matrix,
corresponding to ASCII character pairs would be the same (*independent of 
the encoding used, but still significant for each Latin based language*).

The other 3/4 of the matrix represent pairs with non-ASCII bytes in them, and
their distribution depends on the encoding *and* the language. There's 
usually enough information in the ASCIIxASCII part of the matrix to allow 
one to guess the language with fair accuracy. That would serve on a double 
check for any information gained from the other 3/4 of the matrix to 
determine encoding and language.

In fact, I found that I could determine Western European languages fairly 
reliably from merely recognizing A-Z (one class for each case pair), 
delimiters (lumping space and punctuation into one class) and 'accented 
characters' (lumping all non-ASCII bytes into a single class).  For all 
8-bit character sets based on ASCII, such classes are independent of 
encoding. To determine the encoding, with known language, merely required a 
simple frequency analysis (not even pairs) of the non-ASCII bytes.

If you go to n-grams of n > 2 the peaks and valleys are going to be more 
pronounced, or the pattern of colors if you color the cells of your 
(hyper-) cube. In fact, your hypercupe will look like a galaxy: lots of 
empty space (impossible or rare combinations) and a few pinpricks of bright 
light (The common stuff "ion", "ing" etc.), and perhaps some wisps of 
random coincidences. I'm not surprised to see that they got good results 
with DBCS using n-grams - however, as you imply below, you better have a 
way to efficiently handle empty space.

Note that their approach used n-grams in byte space. A 4-gram would be just 
a pair of DBCS characters, a 2-gram would effectively be a frequency table.

Encoding schemes where there is a large overlap so that few characters, if 
any, are differentiated by the encoding are the most difficult to 
discriminate - but that's true whether you use a language recognizer or 
some other method.

(more..)

>The other non-free language detection implementation you may find is
>from Alis. Netscape 6.0-6.1 (don't remember do we use it for 6.2 or not)
>use the detector from Alis. As I understand, the root of Alis's work is
>from University of Montereel (sorry for misspelling) and probably also
>use the N-gram model.
>
>Addison Phillips [wM] wrote on 11/3/2004, 9:20 PM:
>
>  >
>  > (I'm making an assumption here about what you mean: hopefully it will
>  > answer your question)
>  >
>  > A language detection algorithm is a piece of software that attempts to
>  > infer the language of some textual content by examining the content
>  > alone. Generally this is necessary when one wishes to perform
>  > language-affected operations on some text and needs to know the
>  > language of the material and the information is not available from the
>  > content metadata.
>  >
>  > Examples of content metadata would include HTTP's Content-Language
>  > header, HTML's <meta> tag, XHTML's lang attribute, XML's xml:lang, and
>  > so forth. The best policy in language detection is avoidance: content
>  > should use the various metadata mechanisms available to clearly
>  > identify the language of content in order to avoid the need for
>  > language detection.
>  >
>  > In the absence of this information, certain kinds of processing may be
>  > difficult. For example, searching keywords in text requires splitting
>  > the text up into words. Some languages require special handling in
>  > order to do this. Or deciding what dictionary to use in spell-checking
>  > would be another example.
>  >
>  > In the pre-Unicode era (and, to the extent that legacy encodings are
>  > still used to encoding content), it was sometimes possible to infer
>  > some information about the language or range of possible languages
>  > from the character encoding of the content. For example, the EUC-JP
>  > encoding encodes Japanese characters and is most likely to be used to
>  > encode Japanese language text (never mind that you can encode
>  > perfectly good Russian or English with it!). Other encodings are more
>  > difficult to infer from (for example, ISO 8859-1 aka Latin-1 is widely
>  > used to encode text in several dozen languages, but it is unlikely,
>  > for example, that a Latin-1 document is written in, say, Korean). And
>  > of course Unicode encodings such as UTF-8 by themselves convey no
>  > information at all about the language of the content.
>  >
>  > Absent a hint from the encoding, most LDAs use techniques such as the
>  > relative frequency of different characters in the content. It is
>  > possible to create quite good (but never perfect) language detectors,
>  > given a sufficient amount of content to scan. Given some knowledge of
>  > the text being scanned, you can improve the accuracy of your algorithm
>  > (for example, if you know that all of the documents are French,
>  > German, or Icelandic, you can use ignore other possibilities or apply
>  > shortcuts such as using "stop lists" of common words or scanning for
>  > characters unique to each of these languages).
>  >
>  > Perversely, the most well-known open-source LDA is probably the one
>  > described here:
>  >
>  > http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
>  >
>  > As the URI implies, the goal of that particular LDA is to try and
>  > determine the character encoding used by scanning text for relative
>  > frequency of characters (expressed in this case as byte sequences)
>  > based on statistical frequency in documents in a particular range of
>  > languages.
>
>The N-Gram model work for European languages, but not really practicle
>for Asian languages. This one, architected by me and implemented by
>Shanjian Li is specific focus on how to address issues between Asian
>languages. However, since we targeted our implementation for 'client'
>usage, we optimize the memory usage which trad off a lot of accuracy. If
>someone ask me to reimplement it again for server side, I will do much
>better job :)

Can you elaborate?

While the number of possible n-grams is much larger if you consider n-grams
of *characters*, the number of meaningful n-grams is related to the number
for words (for larger n) in the target language. Now the number of words
(including compound words) is not that different between languages. For
recognition you ignore the infrequent ones anyway, so that gives you several
thousand, and I would imagine that to be fairly constant across languages
of interest.

Now languages with large scripts tend to use fewer charcters/word, so one
would use shorter n-grams. What these guys seem to have done was to use
the same number of bytes, which effectively kept the information content
per n-gram roughly steady across languages (that is, if you use 8-bit and 
DBCS, but not if you use Unicode).

But, as you have actually implemented, and I'm just speculating, you'll
set me straight, no doubt.

A./
Received on Tuesday, 9 November 2004 06:56:07 UTC