W3C home > Mailing lists > Public > www-international@w3.org > July to September 2001

Re: auto-detecting the character encoding of an uploaded file

From: Thierry Sourbier <webmaster@i18ngurus.com>
Date: Fri, 7 Sep 2001 06:17:30 +0200
Message-ID: <011501c13754$031d0aa0$9c48fea9@dell400>
To: "W3intl \(E-mail\)" <www-international@w3.org>
> Yes. Shift_JIS can have bytes in the 0x80-0x9F range, but Latin-1 doesn't.
> If there is such a byte, Shift_JIS cannot be misinterpreted as Latin-1.
> It may be misinterpreted as windows-1252, but that's a different story.

The value of one byte may indeed not be enough to differentiate between
various encodings, but for most european languages it is fairly rare to have
consecutive extended characters (by extended I mean with a code value above
127). Therefore a Shift-JIS encoded Japanese text and a European
windows-1252 one are fairly easy to differentiate when you look at the
entire stream.

Lenny, you might want to have a look at TextCat. This little tool (which
source code is available) helps you recognize 69 languages and encoding
combinations. It probably can easily be extended to more. Byte pattern
analysis is probably the most generic way to go, it gives great result even
on fairly small text size.

Text cat can be found at:
http://odur.let.rug.nl/~vannoord/TextCat/

More links on languages indentification tools & techniques can be found at:
http://www.i18ngurus.com/docs/998504805.html

Cheers,
Thierry

<><><><><><><><><><><><><><><><><><><><><><>
www.i18ngurus.com - Open Internationalization Resources Directory
Received on Friday, 7 September 2001 00:12:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:57 GMT