- From: Bob Jung <bobj@netscape.com>
- Date: Thu, 06 Sep 2001 20:11:20 -0700
- To: Martin Duerst <duerst@w3.org>
- CC: vinod@filemaker.com, Lenny Turetsky <LTuretsky@salesforce.com>, "W3intl (E-mail)" <www-international@w3.org>, Shanjian Li <shanjian@netscape.com>, momoi@netscape.com
FYI, there will be a paper presented at the Nineteenth International Unicode Conference (IUC19), to be held on September 10 - 14, 2001 in San Jose, California : A Composite Approach to Language/Encoding Detection by Shanjian Li & Katsuhiko Momoi - Netscape Communications Session B6: Wednesday, September 12, 2:50 pm - 3:30 pm Abstract: http://www.unicode.org/iuc/iuc19/a322.html And since this is part of Mozilla, it is all open source! http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/ -Bob Martin Duerst wrote: > At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote: > >> Lenny , >> >> Just some thoughts. >> >> Since you have mentioned Shift-JIS, > > > As a charset, spelled shift_jis (case doesn't matter, but the > underscore does). > > >> there is no guarantee that every other >> byte in UTF-16 is zero especially for non-us systems like >> Japanese/European >> . > > > No. But if you see even a single zero byte, then the chance that the > document is in UTF-16 is very high. > > >> Also there is no significance for BOM for UTF-8, which means not all >> applications will add a BOM for the UTF-8 text. > > > Yes indeed, for many reasons, adding a BOM to UTF-8 texts is > discouraged. Detecting UTF-8 is easy enough without a BOM. > > >> Finally, I don't think we >> can come up with an auto-detect algorithm for detecting >> Latin-1/UTF-*/Shift-JIS. > > > For all these, it's not too difficult. Shift-JIS uses bytes in > the 0x80-0x9F range, and has specific patterns. If there are > only very few characters outside us-ascii, it may not work, > but with more non-us-ascii characters, the probability > of success is going up very quickly. > > > Regards, Martin. >
Received on Thursday, 6 September 2001 13:30:47 UTC