Re: auto-detecting the character encoding of an uploaded file

At 17:14 01/09/04 -0700, Lenny Turetsky wrote:
>We have a web application where a user uploads a file that could be in one
>of several different encodings (ISO-8859-1, SHIFT-JIS, UTF-8, UTF-16).  We
>ask the user what the encoding is, so we know how to decode the file.  We
>would like to do some error checking, though, to help prevent users from
>choosing the wrong encoding.
>
>UTF-8 and UTF-16 do have some restrictions on the encoding.  That is good
>for us.  If the Java InputStreamReader class we are using tosses an
>exception, we know it was in the wrong encoding
>
>Can we detect if a file is really in some other format, if the user
>specifies ISO-8859-1?  That is the default, since that is what is generated
>by a majority of the applications that our users use, and many users are too
>dumb to know what the correct one is.  Can we detect some common cases where
>it's in a different format?

The most common would be windows-1252, an extension of iso-8859-1.
This is easily detected when you find bytes in the range 0x80-0x9F.
Even in the US, people often use smart quotes,..., which are
not available in iso-8895-1.


>- UTF-8 and UTF-16 sometimes have a Byte Order Marker (BOM).  This is
>especially true as generated by Microsoft applications (Excel, Notepad),
>which many of our users use.  If we see the first few bytes as FF FE, FE FF,
>or EF BB BF, we know to reject it.  Is this safe?

Not 100%, but extremely close to it.


>- UTF-16 will commonly have every other byte as zero.  ISO-8859-1 shouldn't
>be using zero byte code, as far as I know.  Is it safe to reject any file
>with a zero byte code if the user told us it is ISO-8859-1?

Yes. Same for most other encodings.


>- Besides looking for an optional BOM, I can't think of anything about a
>UTF-8 file that would make it invalid or unusual ISO-8859-1.  Any ideas?

Everything that is not just us-ascii and that passes as utf-8
is extremely likely to actually be utf-8. For some additional
information, please see my paper at
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf

If you know that the file is e.g. Japanese, or have some other
additional information, detection is usually also rather easy
and has a rather good success rate, based on simple bit pattern
analysis.

For other cases, you don't get much unless you use statistical
models based on letter and letter combination frequencies in
particular languages, and the byte patterns these produce in
particular encodings.

On tough end, it's actually impossible to distinguish between
iso-8859-1 and iso-8859-2 for German texts, because the bytes for
the characters used are exactly the same. But maybe in this case,
it doesn't matter too much.

Regards,   Martin.

Received on Wednesday, 5 September 2001 03:41:04 UTC