auto-detecting the character encoding of an uploaded file

We have a web application where a user uploads a file that could be in one
of several different encodings (ISO-8859-1, SHIFT-JIS, UTF-8, UTF-16).  We
ask the user what the encoding is, so we know how to decode the file.  We
would like to do some error checking, though, to help prevent users from
choosing the wrong encoding.

UTF-8 and UTF-16 do have some restrictions on the encoding.  That is good
for us.  If the Java InputStreamReader class we are using tosses an
exception, we know it was in the wrong encoding

Can we detect if a file is really in some other format, if the user
specifies ISO-8859-1?  That is the default, since that is what is generated
by a majority of the applications that our users use, and many users are too
dumb to know what the correct one is.  Can we detect some common cases where
it's in a different format?
- UTF-8 and UTF-16 sometimes have a Byte Order Marker (BOM).  This is
especially true as generated by Microsoft applications (Excel, Notepad),
which many of our users use.  If we see the first few bytes as FF FE, FE FF,
or EF BB BF, we know to reject it.  Is this safe?
- UTF-16 will commonly have every other byte as zero.  ISO-8859-1 shouldn't
be using zero byte code, as far as I know.  Is it safe to reject any file
with a zero byte code if the user told us it is ISO-8859-1?
- Besides looking for an optional BOM, I can't think of anything about a
UTF-8 file that would make it invalid or unusual ISO-8859-1.  Any ideas?

As always, your help is greatly appreciated.

Lenny Turetsky
Senior Member, Technical Staff
i18n Man of Mystery
The Landmark @ One Market
Suite 300
San Francisco, CA 94105 USA

Received on Tuesday, 4 September 2001 20:15:11 UTC