W3C home > Mailing lists > Public > www-international@w3.org > October to December 2007

Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments]

From: Marcos Caceres <marcosscaceres@gmail.com>
Date: Fri, 30 Nov 2007 22:29:16 +1000
Message-ID: <b21a10670711300429o15e53377tf979fe73fe06391e@mail.gmail.com>
To: "Bjoern Hoehrmann" <derhoermi@gmx.net>
Cc: www-international@w3.org, public-appformats@w3.org

Hi Bjoern,

On Nov 30, 2007 10:14 PM, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
> * Marcos Caceres wrote:
> >The WAF Working group is seeking assistance with an i18n problem we
> >are having with our Widgets 1.0 specification [1]. The issue we are
> >having is to do with determining the encoding of file names within Zip
> >archives. Here is an overview of the problem:
>
> This is really an issue with the "ZIP" specification and deployed soft-
> ware, not with the "Widgets" specification. It does not seem useful to
> say anything about this in the Widgets specification beyond saying the
> archive should be created in accordance with the ZIP specification and
> that there may be interoperability issues with using non-ASCII names,
> so those should be avoided, which should be quite normal for authors.

I'm totally ok with doing that... I guess as long as it won't raise
any issues later because we didn't really provide a solution to the
problem. Would this be ok with the i18n community? (ie. make it
Zip/implementer's problem) .

> >The main problem is that there is no way, AFAIK, to determine if the
> >encoding of a file names inside a Zip archive when you hit any bytes
> >that are beyond the ASCII range (could be either cp437 or UTF-8?).
>
> I would not be surprised if there are actually more options than this,
> but it's fairly easy to distinguish these two encodings for file names
> since it is rather difficult to create a sequence of octets that is
> valid UTF-8 and represents a reasonable file name in UTF-8 and CP437.
> A heuristic could simply go like this:
>
>   if decode_cp437(input) is a reasonable file name or
>      input is not valid utf-8, then use cp437;
>   else use utf8;
>
> Reasonable file names do not include box drawings, unassigned code
> points, or mathematical symbols outside the ASCII range. But as above,
> it's not really an issue for the Widgets specification, and authors
> are best off if they avoid non-ASCII names.

Ok, thanks for the tip.

Kind regards,
Marcos
-- 
Marcos Caceres
http://datadriven.com.au
Received on Friday, 30 November 2007 12:29:48 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:15 GMT