- From: Thierry Sourbier <webmaster@i18ngurus.com>
- Date: Fri, 7 Sep 2001 06:17:30 +0200
- To: "W3intl \(E-mail\)" <www-international@w3.org>
> Yes. Shift_JIS can have bytes in the 0x80-0x9F range, but Latin-1 doesn't. > If there is such a byte, Shift_JIS cannot be misinterpreted as Latin-1. > It may be misinterpreted as windows-1252, but that's a different story. The value of one byte may indeed not be enough to differentiate between various encodings, but for most european languages it is fairly rare to have consecutive extended characters (by extended I mean with a code value above 127). Therefore a Shift-JIS encoded Japanese text and a European windows-1252 one are fairly easy to differentiate when you look at the entire stream. Lenny, you might want to have a look at TextCat. This little tool (which source code is available) helps you recognize 69 languages and encoding combinations. It probably can easily be extended to more. Byte pattern analysis is probably the most generic way to go, it gives great result even on fairly small text size. Text cat can be found at: http://odur.let.rug.nl/~vannoord/TextCat/ More links on languages indentification tools & techniques can be found at: http://www.i18ngurus.com/docs/998504805.html Cheers, Thierry <><><><><><><><><><><><><><><><><><><><><><> www.i18ngurus.com - Open Internationalization Resources Directory
Received on Friday, 7 September 2001 00:12:20 UTC