W3C home > Mailing lists > Public > www-international@w3.org > July to September 2001

RE: auto-detecting the character encoding of an uploaded file

From: Martin Duerst <duerst@w3.org>
Date: Thu, 06 Sep 2001 10:36:15 +0900
Message-Id: <>
To: <vinod@filemaker.com>, "Lenny Turetsky" <LTuretsky@salesforce.com>, "W3intl (E-mail)" <www-international@w3.org>
At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote:
>Lenny ,
>Just some thoughts.
>Since you have mentioned Shift-JIS,

As a charset, spelled shift_jis (case doesn't matter, but the underscore does).

>there is no guarantee that every other
>byte in UTF-16 is zero especially for non-us systems like Japanese/European

No. But if you see even a single zero byte, then the chance that the
document is in UTF-16 is very high.

>Also there is no significance for BOM for  UTF-8, which means not all
>applications will add a BOM for the UTF-8 text.

Yes indeed, for many reasons, adding a BOM to UTF-8 texts is
discouraged. Detecting UTF-8 is easy enough without a BOM.

>Finally, I don't think we
>can come up with an auto-detect algorithm for detecting

For all these, it's not too difficult. Shift-JIS uses bytes in
the 0x80-0x9F range, and has specific patterns. If there are
only very few characters outside us-ascii, it may not work,
but with more non-us-ascii characters, the probability
of success is going up very quickly.

Regards,   Martin.
Received on Wednesday, 5 September 2001 22:06:45 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 22:04:18 UTC