- From: A. Vine <avine@eng.sun.com>
- Date: Thu, 06 Sep 2001 14:42:04 -0700
- To: vinod@filemaker.com
- Cc: Martin Duerst <duerst@w3.org>, Lenny Turetsky <LTuretsky@salesforce.com>, "W3intl (E-mail)" <www-international@w3.org>
Small detail - there is no such thing as high ASCII. ASCII is 7-bit (MIME name is US-ASCII). The 8-bit range could be considered the upper ISO-8859 range, or given a number of other names. Andrea iPlanet i18n architect and charset geek Vinod Balakrishnan wrote: > > >For all these, it's not too difficult. Shift-JIS uses bytes in > >the 0x80-0x9F range, and has specific patterns. If there are > >only very few characters outside us-ascii, it may not work, > >but with more non-us-ascii characters, the probability > >of success is going up very quickly. > > Shift-JIS represent the trailing bytes of double byte in 0x40-0xF0 range ( > only the leading byte is in high ASCII range ) . Also the Hankaku (single > byte) Kana is represented as single byte in the high ASCII range. A Japanese > text in Shift-JIS contains single byte kana and Kanji characters can be > misinterpreted as Latin-1 > > -Vinod > -----Original Message----- > From: Martin Duerst [mailto:duerst@w3.org] > Sent: Wednesday, September 05, 2001 6:36 PM > To: vinod@filemaker.com; Lenny Turetsky; W3intl (E-mail) > Subject: RE: auto-detecting the character encoding of an uploaded file > > At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote: > >Lenny , > > > >Just some thoughts. > > > >Since you have mentioned Shift-JIS, > > As a charset, spelled shift_jis (case doesn't matter, but the underscore > does). > > >there is no guarantee that every other > >byte in UTF-16 is zero especially for non-us systems like Japanese/European > >. > > No. But if you see even a single zero byte, then the chance that the > document is in UTF-16 is very high. > > >Also there is no significance for BOM for UTF-8, which means not all > >applications will add a BOM for the UTF-8 text. > > Yes indeed, for many reasons, adding a BOM to UTF-8 texts is > discouraged. Detecting UTF-8 is easy enough without a BOM. > > >Finally, I don't think we > >can come up with an auto-detect algorithm for detecting > >Latin-1/UTF-*/Shift-JIS. > > For all these, it's not too difficult. Shift-JIS uses bytes in > the 0x80-0x9F range, and has specific patterns. If there are > only very few characters outside us-ascii, it may not work, > but with more non-us-ascii characters, the probability > of success is going up very quickly. > > Regards, Martin.
Received on Thursday, 6 September 2001 17:42:49 UTC