- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Sun, 02 Dec 2007 15:15:42 +0900
- To: "Tex Texin" <tex@yahoo-inc.com>, "Marcos Caceres" <marcosscaceres@gmail.com>, "Richard Ishida" <ishida@w3.org>,<www-international@w3.org>
- Cc: "Arthur Barstow" <art.barstow@nokia.com>,<public-i18n-core@w3.org>, <public-appformats@w3.org>,"Thomas Roessler" <tlr@w3.org>
At 20:28 07/11/30, Tex Texin wrote: >One improvement you can make is that if you have non-ASCII characters, you >can assume UTF-8, but check that it is valid UTF-8. >Most text in CP437 won't satisfy UTF-8 encoding rules. >If you have non-ASCII characters, and it doesn't satisfy UTF-8 encoding >principles, then you can assume it is CP437. > >Martin Duerst published a nice Perl expression for checking UTF-8 > >http://www.w3.org/International/questions/qa-forms-utf-8.en.php That regular expression was motivated by some earlier research described in http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf. I didn't analyze CP437, but given that the combination of a box character followed by an accented Latin character is quite rare, my conclusion would be that CP437 is as easy to distinguish from UTF-8 in practice as most other encodings. Otherwise, I agree with Bjoern's conclusions except his very last one, "authors are best off if they avoid non-ASCII names". In this day and age, authors more and more assume that file names in various languages just work. The zip spec does a good job making this possible using UTF-8. It's a pitty that some implementations are not up to the job. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Sunday, 2 December 2007 06:18:25 UTC