RE: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments]

At 20:28 07/11/30, Tex Texin wrote:

>One improvement you can make is that if you have non-ASCII characters, you 
>can assume UTF-8, but check that it is valid UTF-8.
>Most text in CP437 won't satisfy UTF-8 encoding rules.
>If you have non-ASCII characters, and it doesn't satisfy UTF-8 encoding 
>principles, then you can assume it is CP437.
>
>Martin Duerst published a nice Perl expression for checking UTF-8
>
>http://www.w3.org/International/questions/qa-forms-utf-8.en.php

That regular expression was motivated by some earlier research described in
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.

I didn't analyze CP437, but given that the combination of a box
character followed by an accented Latin character is quite rare,
my conclusion would be that CP437 is as easy to distinguish from
UTF-8 in practice as most other encodings.

Otherwise, I agree with Bjoern's conclusions except his very last
one, "authors are best off if they avoid non-ASCII names". In this
day and age, authors more and more assume that file names in various
languages just work. The zip spec does a good job making this possible
using UTF-8. It's a pitty that some implementations are not up to the job.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     

Received on Sunday, 2 December 2007 06:26:24 UTC