- From: Merle Tenney <Merle.Tenney@corp.palm.com>
- Date: Thu, 6 Sep 2001 14:20:51 -0700
- To: "'Bob Jung'" <bobj@netscape.com>, Martin Duerst <duerst@w3.org>
- Cc: vinod@filemaker.com, Lenny Turetsky <LTuretsky@salesforce.com>, "W3intl (E-mail)" <www-international@w3.org>, Shanjian Li <shanjian@netscape.com>, momoi@netscape.com
Thanks, Bob, for the reference to your team's upcoming paper (and hi, by the way). Most of this discussion has focused on subtle differences in legal codepoints in various encodings and legal patterns of bytes in encodings. There is another approach, though, which is much more effective and gives you valuable additional information to boot. That is a system based on relative byte and n-gram frequencies, which have characteristic patterns for a given pair of language and encoding. So in English, for example, "e", "th", "tion", and " the " are quite common, whereas in Spanish "os ", "ción", and " un " are quite common. In different encodings, say Windows 1252, MacRoman, UTF-8, UTF-16BE, and UTF-16LE, this will translate into different relative n-gram frequencies. The way the system works is that first a development version is exposed to a reasonable corpus of a particular language in a particular encoding. For some reason, 100K words seems to stand out in my mind. The system then empirically calculates the frequencies for this corpus, and stores the salient frequencies in a table. And the amazing thing is that it is fast and accurate, and it requires zero intervention by encoding specialists or linguists. This table is packaged up with the tables for the other languages and encodings previously profiled and these tables are included with the auto-detection software, which is normally bundled with a browser, search engine, word processor, etc. In actual use, a passage of user text has *its* n-gram frequencies calculated on the fly, and these are compared to the stored profiles, and a match is made for the profile that most closely matches the sample. In practice, it is surprisingly good and works on a surprisingly short text sample. You can usually determine the language and encoding of the sample with near certainty in a line or two of text. To the best of my knowledge, this approach was first proposed and developed by Ken Beesley in the late 80s while he was at ALP Systems. Here is a reference to that early work: http://www.xrce.xerox.com/people/beesley/langid.html Ken subsequently joined Xerox PARC and then XRCE in Grenoble, where the work was picked up. It was later commercialized by Xerox's spin-off InXight as part of their LinguistX Platform product: http://www.inxight.com/products_sp/linguistx/index.html I know that Microsoft has developed a similar technology, which is shown off quite well in their multilingual spelling checking in Word. However, I don't think it is available to developers outside of Microsoft. Inso also had a competitive technology, which they sold to Lernout & Hauspie. It is the now called the IntelliScope Language Recognizer, and it is part of their IntelliScope Retrieval Toolkit, described here: http://www.lhsl.com/tech/icm/retrieval/toolkit/lr.asp I can't tell from your brief description, Bob, if n-gram frequencies (under a different name) are part of your Mozilla work or not. If they're not, they should be. :-) The bottom line, folks, is that there are a lot better technologies available which allow you to automatically detect encodings, and they come with the tremendous additional benefit of being able to identify the language as well. We can all imagine lots of ways we could use that information. Maybe some of you will start sniffing down a different trail for a solution here.... Merle > -----Original Message----- > From: bobj@netscape.com [mailto:bobj@netscape.com] > Sent: Thursday, September 06, 2001 8:11 PM > To: Martin Duerst > Cc: vinod@filemaker.com; Lenny Turetsky; W3intl (E-mail); Shanjian Li; > momoi@netscape.com > Subject: Re: auto-detecting the character encoding of an uploaded file > > > FYI, there will be a paper presented at the Nineteenth International > Unicode Conference (IUC19), to be held on September 10 - 14, > 2001 in San > Jose, California : > > A Composite Approach to Language/Encoding Detection > by Shanjian Li & Katsuhiko Momoi - Netscape Communications > > Session B6: Wednesday, September 12, 2:50 pm - 3:30 pm > > Abstract: http://www.unicode.org/iuc/iuc19/a322.html > > And since this is part of Mozilla, it is all open source! > > http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/ > > -Bob > > Martin Duerst wrote: > > > At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote: > > > >> Lenny , > >> > >> Just some thoughts. > >> > >> Since you have mentioned Shift-JIS, > > > > > > As a charset, spelled shift_jis (case doesn't matter, but the > > underscore does). > > > > > >> there is no guarantee that every other > >> byte in UTF-16 is zero especially for non-us systems like > >> Japanese/European > >> . > > > > > > No. But if you see even a single zero byte, then the chance that the > > document is in UTF-16 is very high. > > > > > >> Also there is no significance for BOM for UTF-8, which > means not all > >> applications will add a BOM for the UTF-8 text. > > > > > > Yes indeed, for many reasons, adding a BOM to UTF-8 texts is > > discouraged. Detecting UTF-8 is easy enough without a BOM. > > > > > >> Finally, I don't think we > >> can come up with an auto-detect algorithm for detecting > >> Latin-1/UTF-*/Shift-JIS. > > > > > > For all these, it's not too difficult. Shift-JIS uses bytes in > > the 0x80-0x9F range, and has specific patterns. If there are > > only very few characters outside us-ascii, it may not work, > > but with more non-us-ascii characters, the probability > > of success is going up very quickly. > > > > > > Regards, Martin. > > > >
Received on Thursday, 6 September 2001 17:22:14 UTC