RE: auto-detecting the character encoding of an uploaded file

At 10:40 01/09/06 -0700, Vinod Balakrishnan wrote:
> >For all these, it's not too difficult. Shift-JIS uses bytes in
> >the 0x80-0x9F range, and has specific patterns. If there are
> >only very few characters outside us-ascii, it may not work,
> >but with more non-us-ascii characters, the probability
> >of success is going up very quickly.
>
>Shift-JIS represent the trailing bytes of double byte in 0x40-0xF0 range (
>only the leading byte is in high ASCII range ) . Also the Hankaku (single
>byte) Kana is represented as single byte in the high ASCII range. A Japanese
>text in Shift-JIS contains single byte kana and Kanji characters can be
>misinterpreted as Latin-1

Yes. Shift_JIS can have bytes in the 0x80-0x9F range, but Latin-1 doesn't.
If there is such a byte, Shift_JIS cannot be misinterpreted as Latin-1.
It may be misinterpreted as windows-1252, but that's a different story.

Regards,  Martin.

Received on Thursday, 6 September 2001 21:50:36 UTC