RE: auto-detecting the character encoding of an uploaded file from Martin Duerst on 2001-09-06 (www-international@w3.org from July to September 2001)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 07 Sep 2001 08:14:00 +0900
To: <vinod@filemaker.com>, "Lenny Turetsky" <LTuretsky@salesforce.com>, "W3intl (E-mail)" <www-international@w3.org>
Message-Id: <4.2.0.58.J.20010907081205.03e20d20@localhost>

At 10:40 01/09/06 -0700, Vinod Balakrishnan wrote:
> >For all these, it's not too difficult. Shift-JIS uses bytes in
> >the 0x80-0x9F range, and has specific patterns. If there are
> >only very few characters outside us-ascii, it may not work,
> >but with more non-us-ascii characters, the probability
> >of success is going up very quickly.
>
>Shift-JIS represent the trailing bytes of double byte in 0x40-0xF0 range (
>only the leading byte is in high ASCII range ) . Also the Hankaku (single
>byte) Kana is represented as single byte in the high ASCII range. A Japanese
>text in Shift-JIS contains single byte kana and Kanji characters can be
>misinterpreted as Latin-1

Yes. Shift_JIS can have bytes in the 0x80-0x9F range, but Latin-1 doesn't.
If there is such a byte, Shift_JIS cannot be misinterpreted as Latin-1.
It may be misinterpreted as windows-1252, but that's a different story.

Regards,  Martin.

Received on Thursday, 6 September 2001 21:50:36 UTC