Re: auto-detecting the character encoding of an uploaded file from Bob Jung on 2001-09-07 (www-international@w3.org from July to September 2001)

From: Bob Jung <bobj@netscape.com>
Date: Thu, 06 Sep 2001 20:11:20 -0700
To: Martin Duerst <duerst@w3.org>
CC: vinod@filemaker.com, Lenny Turetsky <LTuretsky@salesforce.com>, "W3intl (E-mail)" <www-international@w3.org>, Shanjian Li <shanjian@netscape.com>, momoi@netscape.com
Message-ID: <3B983AD8.5040205@netscape.com>

FYI, there will be a paper presented at the Nineteenth International 
Unicode Conference (IUC19), to be held on September 10 - 14, 2001 in San 
Jose, California :

    A Composite Approach to Language/Encoding Detection
       by Shanjian Li & Katsuhiko Momoi - Netscape Communications

    Session B6: Wednesday, September 12, 2:50 pm - 3:30 pm

    Abstract: http://www.unicode.org/iuc/iuc19/a322.html

And since this is part of Mozilla, it is all open source!
    http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

-Bob

Martin Duerst wrote:

> At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote:
>
>> Lenny ,
>>
>> Just some thoughts.
>>
>> Since you have mentioned Shift-JIS,
>
>
> As a charset, spelled shift_jis (case doesn't matter, but the 
> underscore does).
>
>
>> there is no guarantee that every other
>> byte in UTF-16 is zero especially for non-us systems like 
>> Japanese/European
>> .
>
>
> No. But if you see even a single zero byte, then the chance that the
> document is in UTF-16 is very high.
>
>
>> Also there is no significance for BOM for  UTF-8, which means not all
>> applications will add a BOM for the UTF-8 text.
>
>
> Yes indeed, for many reasons, adding a BOM to UTF-8 texts is
> discouraged. Detecting UTF-8 is easy enough without a BOM.
>
>
>> Finally, I don't think we
>> can come up with an auto-detect algorithm for detecting
>> Latin-1/UTF-*/Shift-JIS.
>
>
> For all these, it's not too difficult. Shift-JIS uses bytes in
> the 0x80-0x9F range, and has specific patterns. If there are
> only very few characters outside us-ascii, it may not work,
> but with more non-us-ascii characters, the probability
> of success is going up very quickly.
>
>
> Regards,   Martin.
>

Received on Thursday, 6 September 2001 13:30:47 UTC