W3C home > Mailing lists > Public > public-appformats@w3.org > December 2007

Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments]

From: Marcos Caceres <marcosscaceres@gmail.com>
Date: Mon, 3 Dec 2007 11:36:09 +1000
Message-ID: <b21a10670712021736y7fdeb795v2154e98526f19819@mail.gmail.com>
To: "Martin Duerst" <duerst@it.aoyama.ac.jp>
Cc: "Tex Texin" <tex@yahoo-inc.com>, "Richard Ishida" <ishida@w3.org>, www-international@w3.org, "Arthur Barstow" <art.barstow@nokia.com>, public-i18n-core@w3.org, public-appformats@w3.org, "Thomas Roessler" <tlr@w3.org>

Hi Martin,

On Dec 2, 2007 4:15 PM, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
> At 20:28 07/11/30, Tex Texin wrote:
>
> >One improvement you can make is that if you have non-ASCII characters, you
> >can assume UTF-8, but check that it is valid UTF-8.
> >Most text in CP437 won't satisfy UTF-8 encoding rules.
> >If you have non-ASCII characters, and it doesn't satisfy UTF-8 encoding
> >principles, then you can assume it is CP437.
> >
> >Martin Duerst published a nice Perl expression for checking UTF-8
> >
> >http://www.w3.org/International/questions/qa-forms-utf-8.en.php
>
> That regular expression was motivated by some earlier research described in
> http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.
>
> I didn't analyze CP437, but given that the combination of a box
> character followed by an accented Latin character is quite rare,
> my conclusion would be that CP437 is as easy to distinguish from
> UTF-8 in practice as most other encodings.
>
> Otherwise, I agree with Bjoern's conclusions except his very last
> one, "authors are best off if they avoid non-ASCII names". In this
> day and age, authors more and more assume that file names in various
> languages just work. The zip spec does a good job making this possible
> using UTF-8. It's a pitty that some implementations are not up to the job.

I also agree with Bjoern in regards to deferring the problem to Zip
implementations by mandating that implementers follow the Zip Spec
(v6.3+) in regards to UTF-8. I guess if someone wanted to implement
the Widget spec, then, being aware of this internationalization
problem, they could write their own custom file name processor to
overcome the cp437 to UTF-8 problem. The sense I am getting from this
thread is that having implementers deal with Zip filename problems on
their own would be ok, and that the widget spec should only assume
that it lives in a world were Zip files conform to version 6.3+ of the
Zip spec.... I guess this still bothers me a little bit, because we
are aware that there is a problem, and it's probably an easy fix to
achieve much greater interoperability.

Kind regards,
-- 
Marcos Caceres
http://datadriven.com.au
Received on Monday, 3 December 2007 01:36:21 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:10:24 GMT