RE: auto-detecting the character encoding of an uploaded file from Merle Tenney on 2001-09-06 (www-international@w3.org from July to September 2001)

From: Merle Tenney <Merle.Tenney@corp.palm.com>
Date: Thu, 6 Sep 2001 14:20:51 -0700
To: "'Bob Jung'" <bobj@netscape.com>, Martin Duerst <duerst@w3.org>
Cc: vinod@filemaker.com, Lenny Turetsky <LTuretsky@salesforce.com>, "W3intl (E-mail)" <www-international@w3.org>, Shanjian Li <shanjian@netscape.com>, momoi@netscape.com
Message-ID: <018564297D520A488BD6331EA4229282042BE087@ussccm002.corp.palm.com>
Thanks, Bob, for the reference to your team's upcoming paper (and hi, by the
way).

Most of this discussion has focused on subtle differences in legal
codepoints in various encodings and legal patterns of bytes in encodings.
There is another approach, though, which is much more effective and gives
you valuable additional information to boot.  That is a system based on
relative byte and n-gram frequencies, which have characteristic patterns for
a given pair of language and encoding.    So in English, for example, "e",
"th", "tion", and " the " are quite common, whereas in Spanish "os ",
"ción", and " un " are quite common.  In different encodings, say Windows
1252, MacRoman, UTF-8, UTF-16BE, and UTF-16LE, this will translate into
different relative n-gram frequencies.

The way the system works is that first a development version is exposed to a
reasonable corpus of a particular language in a particular encoding.  For
some reason, 100K words seems to stand out in my mind.  The system then
empirically calculates the frequencies for this corpus, and stores the
salient frequencies in a table.  And the amazing thing is that it is fast
and accurate, and it requires zero intervention by encoding specialists or
linguists.  This table is packaged up with the tables for the other
languages and encodings previously profiled and these tables are included
with the auto-detection software, which is normally bundled with a browser,
search engine, word processor, etc.  In actual use, a passage of user text
has *its* n-gram frequencies calculated on the fly, and these are compared
to the stored profiles, and a match is made for the profile that most
closely matches the sample.  In practice, it is surprisingly good and works
on a surprisingly short text sample.  You can usually determine the language
and encoding of the sample with near certainty in a line or two of text.

To the best of my knowledge, this approach was first proposed and developed
by Ken Beesley in the late 80s while he was at ALP Systems.  Here is a
reference to that early work:

http://www.xrce.xerox.com/people/beesley/langid.html

Ken subsequently joined Xerox PARC and then XRCE in Grenoble, where the work
was picked up.  It was later commercialized by Xerox's spin-off InXight as
part of their LinguistX Platform product:

http://www.inxight.com/products_sp/linguistx/index.html

I know that Microsoft has developed a similar technology, which is shown off
quite well in their multilingual spelling checking in Word.  However, I
don't think it is available to developers outside of Microsoft.  Inso also
had a competitive technology, which they sold to Lernout & Hauspie.  It is
the now called the IntelliScope Language Recognizer, and it is part of
their IntelliScope Retrieval Toolkit, described here:

http://www.lhsl.com/tech/icm/retrieval/toolkit/lr.asp

I can't tell from your brief description, Bob, if n-gram frequencies (under
a different name) are part of your Mozilla work or not.  If they're not,
they should be.  :-)

The bottom line, folks, is that there are a lot better technologies
available which allow you to automatically detect encodings, and they come
with the tremendous additional benefit of being able to identify the
language as well.  We can all imagine lots of ways we could use that
information.  Maybe some of you will start sniffing down a different trail
for a solution here....

Merle

> -----Original Message-----
> From: bobj@netscape.com [mailto:bobj@netscape.com]
> Sent: Thursday, September 06, 2001 8:11 PM
> To: Martin Duerst
> Cc: vinod@filemaker.com; Lenny Turetsky; W3intl (E-mail); Shanjian Li;
> momoi@netscape.com
> Subject: Re: auto-detecting the character encoding of an uploaded file
> 
> 
> FYI, there will be a paper presented at the Nineteenth International 
> Unicode Conference (IUC19), to be held on September 10 - 14, 
> 2001 in San 
> Jose, California :
> 
>     A Composite Approach to Language/Encoding Detection
>        by Shanjian Li & Katsuhiko Momoi - Netscape Communications
> 
>     Session B6: Wednesday, September 12, 2:50 pm - 3:30 pm
> 
>     Abstract: http://www.unicode.org/iuc/iuc19/a322.html
> 
> And since this is part of Mozilla, it is all open source!
>     
> http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
> 
> -Bob
> 
> Martin Duerst wrote:
> 
> > At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote:
> >
> >> Lenny ,
> >>
> >> Just some thoughts.
> >>
> >> Since you have mentioned Shift-JIS,
> >
> >
> > As a charset, spelled shift_jis (case doesn't matter, but the 
> > underscore does).
> >
> >
> >> there is no guarantee that every other
> >> byte in UTF-16 is zero especially for non-us systems like 
> >> Japanese/European
> >> .
> >
> >
> > No. But if you see even a single zero byte, then the chance that the
> > document is in UTF-16 is very high.
> >
> >
> >> Also there is no significance for BOM for  UTF-8, which 
> means not all
> >> applications will add a BOM for the UTF-8 text.
> >
> >
> > Yes indeed, for many reasons, adding a BOM to UTF-8 texts is
> > discouraged. Detecting UTF-8 is easy enough without a BOM.
> >
> >
> >> Finally, I don't think we
> >> can come up with an auto-detect algorithm for detecting
> >> Latin-1/UTF-*/Shift-JIS.
> >
> >
> > For all these, it's not too difficult. Shift-JIS uses bytes in
> > the 0x80-0x9F range, and has specific patterns. If there are
> > only very few characters outside us-ascii, it may not work,
> > but with more non-us-ascii characters, the probability
> > of success is going up very quickly.
> >
> >
> > Regards,   Martin.
> >
> 
>
Received on Thursday, 6 September 2001 17:22:14 UTC