RE: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments] from Martin Duerst on 2007-12-02 (public-i18n-core@w3.org from October to December 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Sun, 02 Dec 2007 15:15:42 +0900
To: "Tex Texin" <tex@yahoo-inc.com>, "Marcos Caceres" <marcosscaceres@gmail.com>, "Richard Ishida" <ishida@w3.org>,<www-international@w3.org>
Cc: "Arthur Barstow" <art.barstow@nokia.com>,<public-i18n-core@w3.org>, <public-appformats@w3.org>,"Thomas Roessler" <tlr@w3.org>
Message-Id: <6.0.0.20.2.20071202150848.09f389c0@localhost>

At 20:28 07/11/30, Tex Texin wrote:

>One improvement you can make is that if you have non-ASCII characters, you 
>can assume UTF-8, but check that it is valid UTF-8.
>Most text in CP437 won't satisfy UTF-8 encoding rules.
>If you have non-ASCII characters, and it doesn't satisfy UTF-8 encoding 
>principles, then you can assume it is CP437.
>
>Martin Duerst published a nice Perl expression for checking UTF-8
>
>http://www.w3.org/International/questions/qa-forms-utf-8.en.php

That regular expression was motivated by some earlier research described in
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.

I didn't analyze CP437, but given that the combination of a box
character followed by an accented Latin character is quite rare,
my conclusion would be that CP437 is as easy to distinguish from
UTF-8 in practice as most other encodings.

Otherwise, I agree with Bjoern's conclusions except his very last
one, "authors are best off if they avoid non-ASCII names". In this
day and age, authors more and more assume that file names in various
languages just work. The zip spec does a good job making this possible
using UTF-8. It's a pitty that some implementations are not up to the job.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Received on Sunday, 2 December 2007 06:18:24 UTC