Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments]

* Marcos Caceres wrote:
>The WAF Working group is seeking assistance with an i18n problem we
>are having with our Widgets 1.0 specification [1]. The issue we are
>having is to do with determining the encoding of file names within Zip
>archives. Here is an overview of the problem:

This is really an issue with the "ZIP" specification and deployed soft-
ware, not with the "Widgets" specification. It does not seem useful to
say anything about this in the Widgets specification beyond saying the
archive should be created in accordance with the ZIP specification and
that there may be interoperability issues with using non-ASCII names,
so those should be avoided, which should be quite normal for authors.

>The main problem is that there is no way, AFAIK, to determine if the
>encoding of a file names inside a Zip archive when you hit any bytes
>that are beyond the ASCII range (could be either cp437 or UTF-8?).

I would not be surprised if there are actually more options than this,
but it's fairly easy to distinguish these two encodings for file names
since it is rather difficult to create a sequence of octets that is
valid UTF-8 and represents a reasonable file name in UTF-8 and CP437.
A heuristic could simply go like this:

  if decode_cp437(input) is a reasonable file name or
     input is not valid utf-8, then use cp437;
  else use utf8;

Reasonable file names do not include box drawings, unassigned code
points, or mathematical symbols outside the ASCII range. But as above,
it's not really an issue for the Widgets specification, and authors
are best off if they avoid non-ASCII names.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Friday, 30 November 2007 12:15:08 UTC