- From: Marcos Caceres <marcosscaceres@gmail.com>
- Date: Fri, 30 Nov 2007 22:52:49 +1000
- To: "Bjoern Hoehrmann" <derhoermi@gmx.net>
- Cc: www-international@w3.org, public-appformats@w3.org
Hi All, Ok, I've changed the Widgets spec[1] to match the Zip specification (as Bjoern suggested), including mandating the the use of the CP437 and added a warning about interoperability issues of going outside the US-ASCII range. This issue about file name encodings will remain open for now; any additional comments and suggestions are welcomed. Kind regards, Marcos [1] http://dev.w3.org/2006/waf/widgets/#zip-archive0 On Nov 30, 2007 10:29 PM, Marcos Caceres <marcosscaceres@gmail.com> wrote: > Hi Bjoern, > > On Nov 30, 2007 10:14 PM, Bjoern Hoehrmann <derhoermi@gmx.net> wrote: > > * Marcos Caceres wrote: > > >The WAF Working group is seeking assistance with an i18n problem we > > >are having with our Widgets 1.0 specification [1]. The issue we are > > >having is to do with determining the encoding of file names within Zip > > >archives. Here is an overview of the problem: > > > > This is really an issue with the "ZIP" specification and deployed soft- > > ware, not with the "Widgets" specification. It does not seem useful to > > say anything about this in the Widgets specification beyond saying the > > archive should be created in accordance with the ZIP specification and > > that there may be interoperability issues with using non-ASCII names, > > so those should be avoided, which should be quite normal for authors. > > I'm totally ok with doing that... I guess as long as it won't raise > any issues later because we didn't really provide a solution to the > problem. Would this be ok with the i18n community? (ie. make it > Zip/implementer's problem) . > > > >The main problem is that there is no way, AFAIK, to determine if the > > >encoding of a file names inside a Zip archive when you hit any bytes > > >that are beyond the ASCII range (could be either cp437 or UTF-8?). > > > > I would not be surprised if there are actually more options than this, > > but it's fairly easy to distinguish these two encodings for file names > > since it is rather difficult to create a sequence of octets that is > > valid UTF-8 and represents a reasonable file name in UTF-8 and CP437. > > A heuristic could simply go like this: > > > > if decode_cp437(input) is a reasonable file name or > > input is not valid utf-8, then use cp437; > > else use utf8; > > > > Reasonable file names do not include box drawings, unassigned code > > points, or mathematical symbols outside the ASCII range. But as above, > > it's not really an issue for the Widgets specification, and authors > > are best off if they avoid non-ASCII names. > > Ok, thanks for the tip. > > Kind regards, > Marcos > > -- > Marcos Caceres > http://datadriven.com.au > -- Marcos Caceres http://datadriven.com.au
Received on Friday, 30 November 2007 12:52:57 UTC