Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments]

Hi All,
Ok, I've changed the Widgets spec[1] to match the Zip specification
(as Bjoern suggested), including mandating the the use of the CP437
and added a warning about interoperability issues of going outside the
US-ASCII range.

This issue about file name encodings will remain open for now; any
additional comments and suggestions are welcomed.

Kind regards,
Marcos

[1] http://dev.w3.org/2006/waf/widgets/#zip-archive0


On Nov 30, 2007 10:29 PM, Marcos Caceres <marcosscaceres@gmail.com> wrote:
> Hi Bjoern,
>
> On Nov 30, 2007 10:14 PM, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
> > * Marcos Caceres wrote:
> > >The WAF Working group is seeking assistance with an i18n problem we
> > >are having with our Widgets 1.0 specification [1]. The issue we are
> > >having is to do with determining the encoding of file names within Zip
> > >archives. Here is an overview of the problem:
> >
> > This is really an issue with the "ZIP" specification and deployed soft-
> > ware, not with the "Widgets" specification. It does not seem useful to
> > say anything about this in the Widgets specification beyond saying the
> > archive should be created in accordance with the ZIP specification and
> > that there may be interoperability issues with using non-ASCII names,
> > so those should be avoided, which should be quite normal for authors.
>
> I'm totally ok with doing that... I guess as long as it won't raise
> any issues later because we didn't really provide a solution to the
> problem. Would this be ok with the i18n community? (ie. make it
> Zip/implementer's problem) .
>
> > >The main problem is that there is no way, AFAIK, to determine if the
> > >encoding of a file names inside a Zip archive when you hit any bytes
> > >that are beyond the ASCII range (could be either cp437 or UTF-8?).
> >
> > I would not be surprised if there are actually more options than this,
> > but it's fairly easy to distinguish these two encodings for file names
> > since it is rather difficult to create a sequence of octets that is
> > valid UTF-8 and represents a reasonable file name in UTF-8 and CP437.
> > A heuristic could simply go like this:
> >
> >   if decode_cp437(input) is a reasonable file name or
> >      input is not valid utf-8, then use cp437;
> >   else use utf8;
> >
> > Reasonable file names do not include box drawings, unassigned code
> > points, or mathematical symbols outside the ASCII range. But as above,
> > it's not really an issue for the Widgets specification, and authors
> > are best off if they avoid non-ASCII names.
>
> Ok, thanks for the tip.
>
> Kind regards,
> Marcos
>
> --
> Marcos Caceres
> http://datadriven.com.au
>



-- 
Marcos Caceres
http://datadriven.com.au

Received on Friday, 30 November 2007 12:52:57 UTC