W3C home > Mailing lists > Public > www-international@w3.org > July to September 2001

Re: auto-detecting the character encoding of an uploaded file

From: A. Vine <avine@eng.sun.com>
Date: Thu, 06 Sep 2001 14:42:04 -0700
To: vinod@filemaker.com
Cc: Martin Duerst <duerst@w3.org>, Lenny Turetsky <LTuretsky@salesforce.com>, "W3intl (E-mail)" <www-international@w3.org>
Message-id: <3B97EDAC.66FE789D@eng.sun.com>
Small detail - there is no such thing as high ASCII.  ASCII is 7-bit (MIME name
is US-ASCII).  The 8-bit range could be considered the upper ISO-8859 range, or
given a number of other names.

Andrea
iPlanet i18n architect and charset geek

Vinod Balakrishnan wrote:
> 
> >For all these, it's not too difficult. Shift-JIS uses bytes in
> >the 0x80-0x9F range, and has specific patterns. If there are
> >only very few characters outside us-ascii, it may not work,
> >but with more non-us-ascii characters, the probability
> >of success is going up very quickly.
> 
> Shift-JIS represent the trailing bytes of double byte in 0x40-0xF0 range (
> only the leading byte is in high ASCII range ) . Also the Hankaku (single
> byte) Kana is represented as single byte in the high ASCII range. A Japanese
> text in Shift-JIS contains single byte kana and Kanji characters can be
> misinterpreted as Latin-1
> 
> -Vinod
> -----Original Message-----
> From: Martin Duerst [mailto:duerst@w3.org]
> Sent: Wednesday, September 05, 2001 6:36 PM
> To: vinod@filemaker.com; Lenny Turetsky; W3intl (E-mail)
> Subject: RE: auto-detecting the character encoding of an uploaded file
> 
> At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote:
> >Lenny ,
> >
> >Just some thoughts.
> >
> >Since you have mentioned Shift-JIS,
> 
> As a charset, spelled shift_jis (case doesn't matter, but the underscore
> does).
> 
> >there is no guarantee that every other
> >byte in UTF-16 is zero especially for non-us systems like Japanese/European
> >.
> 
> No. But if you see even a single zero byte, then the chance that the
> document is in UTF-16 is very high.
> 
> >Also there is no significance for BOM for  UTF-8, which means not all
> >applications will add a BOM for the UTF-8 text.
> 
> Yes indeed, for many reasons, adding a BOM to UTF-8 texts is
> discouraged. Detecting UTF-8 is easy enough without a BOM.
> 
> >Finally, I don't think we
> >can come up with an auto-detect algorithm for detecting
> >Latin-1/UTF-*/Shift-JIS.
> 
> For all these, it's not too difficult. Shift-JIS uses bytes in
> the 0x80-0x9F range, and has specific patterns. If there are
> only very few characters outside us-ascii, it may not work,
> but with more non-us-ascii characters, the probability
> of success is going up very quickly.
> 
> Regards,   Martin.
Received on Thursday, 6 September 2001 17:42:49 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:57 GMT