- From: Lenny Turetsky <LTuretsky@salesforce.com>
- Date: Tue, 4 Sep 2001 17:14:38 -0700
- To: "W3intl (E-mail)" <www-international@w3.org>
We have a web application where a user uploads a file that could be in one of several different encodings (ISO-8859-1, SHIFT-JIS, UTF-8, UTF-16). We ask the user what the encoding is, so we know how to decode the file. We would like to do some error checking, though, to help prevent users from choosing the wrong encoding. UTF-8 and UTF-16 do have some restrictions on the encoding. That is good for us. If the Java InputStreamReader class we are using tosses an exception, we know it was in the wrong encoding Can we detect if a file is really in some other format, if the user specifies ISO-8859-1? That is the default, since that is what is generated by a majority of the applications that our users use, and many users are too dumb to know what the correct one is. Can we detect some common cases where it's in a different format? - UTF-8 and UTF-16 sometimes have a Byte Order Marker (BOM). This is especially true as generated by Microsoft applications (Excel, Notepad), which many of our users use. If we see the first few bytes as FF FE, FE FF, or EF BB BF, we know to reject it. Is this safe? - UTF-16 will commonly have every other byte as zero. ISO-8859-1 shouldn't be using zero byte code, as far as I know. Is it safe to reject any file with a zero byte code if the user told us it is ISO-8859-1? - Besides looking for an optional BOM, I can't think of anything about a UTF-8 file that would make it invalid or unusual ISO-8859-1. Any ideas? As always, your help is greatly appreciated. Lenny Turetsky Senior Member, Technical Staff i18n Man of Mystery salesforce.com The Landmark @ One Market Suite 300 San Francisco, CA 94105 USA +1.415.901.5078 lturetsky@salesforce.com
Received on Tuesday, 4 September 2001 20:15:11 UTC