- From: Vinod Balakrishnan <vinod_balakrishnan@filemaker.com>
- Date: Thu, 6 Sep 2001 10:40:14 -0700
- To: "Martin Duerst" <duerst@w3.org>, "Lenny Turetsky" <LTuretsky@salesforce.com>, "W3intl \(E-mail\)" <www-international@w3.org>
>For all these, it's not too difficult. Shift-JIS uses bytes in >the 0x80-0x9F range, and has specific patterns. If there are >only very few characters outside us-ascii, it may not work, >but with more non-us-ascii characters, the probability >of success is going up very quickly. Shift-JIS represent the trailing bytes of double byte in 0x40-0xF0 range ( only the leading byte is in high ASCII range ) . Also the Hankaku (single byte) Kana is represented as single byte in the high ASCII range. A Japanese text in Shift-JIS contains single byte kana and Kanji characters can be misinterpreted as Latin-1 -Vinod -----Original Message----- From: Martin Duerst [mailto:duerst@w3.org] Sent: Wednesday, September 05, 2001 6:36 PM To: vinod@filemaker.com; Lenny Turetsky; W3intl (E-mail) Subject: RE: auto-detecting the character encoding of an uploaded file At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote: >Lenny , > >Just some thoughts. > >Since you have mentioned Shift-JIS, As a charset, spelled shift_jis (case doesn't matter, but the underscore does). >there is no guarantee that every other >byte in UTF-16 is zero especially for non-us systems like Japanese/European >. No. But if you see even a single zero byte, then the chance that the document is in UTF-16 is very high. >Also there is no significance for BOM for UTF-8, which means not all >applications will add a BOM for the UTF-8 text. Yes indeed, for many reasons, adding a BOM to UTF-8 texts is discouraged. Detecting UTF-8 is easy enough without a BOM. >Finally, I don't think we >can come up with an auto-detect algorithm for detecting >Latin-1/UTF-*/Shift-JIS. For all these, it's not too difficult. Shift-JIS uses bytes in the 0x80-0x9F range, and has specific patterns. If there are only very few characters outside us-ascii, it may not work, but with more non-us-ascii characters, the probability of success is going up very quickly. Regards, Martin.
Received on Thursday, 6 September 2001 13:37:43 UTC