RE: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments]

Bjoern,

If we are discussing reasonable, reasonable software does not use CP437 as an encoding, for at least the last 10 years. ;-)

FWIW, a reasonable software system wouldn't allow drawing characters in a name, but DOS did, and so many users did use smiley faces, card suits, and the like in their file names. Generated filenames often used them.

If we were talking about ancient zip files, then we might have to tolerate such names. But as we are discussing new specs, instead of supporting all possible zip files, why not specify the use of zip files, with the constraint that filenames be stored as utf-8, and make the non-utf-8 names unacceptable. Perhaps with modern tools, this isn't a problem.

I would like to preserve the ability of authors to use non-ASCII file and pathnames.
We can still suggest that non-ASCII characters may cause problems.

Separately, but related, a problem with filenames is case-sensitivity. It is not uncommon to have problems installing a zip file, authored on a case-sensitive system, onto a case-insensitive file system, due to collisions and possible overwrites being detected.
It might be good to warn authors to avoid this problem as well.

tex

-----Original Message-----
From: www-international-request@w3.org [mailto:www-international-request@w3.org] On Behalf Of Bjoern Hoehrmann
Sent: Friday, November 30, 2007 4:15 AM
To: Marcos Caceres
Cc: www-international@w3.org; public-appformats@w3.org
Subject: Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments]


* Marcos Caceres wrote:
>The WAF Working group is seeking assistance with an i18n problem we are 
>having with our Widgets 1.0 specification [1]. The issue we are having 
>is to do with determining the encoding of file names within Zip 
>archives. Here is an overview of the problem:

This is really an issue with the "ZIP" specification and deployed soft- ware, not with the "Widgets" specification. It does not seem useful to say anything about this in the Widgets specification beyond saying the archive should be created in accordance with the ZIP specification and that there may be interoperability issues with using non-ASCII names, so those should be avoided, which should be quite normal for authors.

>The main problem is that there is no way, AFAIK, to determine if the 
>encoding of a file names inside a Zip archive when you hit any bytes 
>that are beyond the ASCII range (could be either cp437 or UTF-8?).

I would not be surprised if there are actually more options than this, but it's fairly easy to distinguish these two encodings for file names since it is rather difficult to create a sequence of octets that is valid UTF-8 and represents a reasonable file name in UTF-8 and CP437.
A heuristic could simply go like this:

  if decode_cp437(input) is a reasonable file name or
     input is not valid utf-8, then use cp437;
  else use utf8;

Reasonable file names do not include box drawings, unassigned code points, or mathematical symbols outside the ASCII range. But as above, it's not really an issue for the Widgets specification, and authors are best off if they avoid non-ASCII names.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Friday, 30 November 2007 13:01:18 UTC