RE: auto-detecting the character encoding of an uploaded file

>For all these, it's not too difficult. Shift-JIS uses bytes in
>the 0x80-0x9F range, and has specific patterns. If there are
>only very few characters outside us-ascii, it may not work,
>but with more non-us-ascii characters, the probability
>of success is going up very quickly.

Shift-JIS represent the trailing bytes of double byte in 0x40-0xF0 range (
only the leading byte is in high ASCII range ) . Also the Hankaku (single
byte) Kana is represented as single byte in the high ASCII range. A Japanese
text in Shift-JIS contains single byte kana and Kanji characters can be
misinterpreted as Latin-1

-Vinod
-----Original Message-----
From: Martin Duerst [mailto:duerst@w3.org]
Sent: Wednesday, September 05, 2001 6:36 PM
To: vinod@filemaker.com; Lenny Turetsky; W3intl (E-mail)
Subject: RE: auto-detecting the character encoding of an uploaded file


At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote:
>Lenny ,
>
>Just some thoughts.
>
>Since you have mentioned Shift-JIS,

As a charset, spelled shift_jis (case doesn't matter, but the underscore
does).


>there is no guarantee that every other
>byte in UTF-16 is zero especially for non-us systems like Japanese/European
>.

No. But if you see even a single zero byte, then the chance that the
document is in UTF-16 is very high.


>Also there is no significance for BOM for  UTF-8, which means not all
>applications will add a BOM for the UTF-8 text.

Yes indeed, for many reasons, adding a BOM to UTF-8 texts is
discouraged. Detecting UTF-8 is easy enough without a BOM.


>Finally, I don't think we
>can come up with an auto-detect algorithm for detecting
>Latin-1/UTF-*/Shift-JIS.

For all these, it's not too difficult. Shift-JIS uses bytes in
the 0x80-0x9F range, and has specific patterns. If there are
only very few characters outside us-ascii, it may not work,
but with more non-us-ascii characters, the probability
of success is going up very quickly.


Regards,   Martin.

Received on Thursday, 6 September 2001 13:37:43 UTC