Re: [Widgets] ASCII File names - request for comments from Thomas Roessler on 2007-11-24 (public-appformats@w3.org from November 2007)

From: Thomas Roessler <tlr@w3.org>
Date: Sat, 24 Nov 2007 10:53:13 +0100
To: Marcos Caceres <marcosscaceres@gmail.com>
Cc: "public-appformats@w3.org" <public-appformats@w3.org>, Arve Bersvendsen <arveb@opera.com>, ishida@w3.org
Message-ID: <20071124095313.GD20944@raktajino.does-not-exist.org>
On 2007-11-23 11:32:43 +1000, Marcos Caceres wrote:

> > On 2007-11-22 17:02:44 +1000, Marcos Caceres wrote:
> >
> > > The zip relative path will represent one of:
> > >
> > > * the name of a file (eg. index.html),
> > > * the name of a folder (eg. logs/),
> > > * the name of a folder within a hierarchy of folders (eg.
> > styles/sounds/),
> > > * or the name of a file within a hierarchy of folders (eg.
> > > styles/images/background.png).
> >
> > Is there a BNF grammar for the zip relative path?
> 
> 
> I've added the following a note:
> 
> "Note: The ABNF for the validation of US-ASCII-based zip relative paths is
> defined in the section titled Rules for validating US-ASCII paths, and the
> ABNF for the validation of UTF-8-based zip relative paths is defined in the
> section titled Rules for validating UTF-8 paths."

That doesn't help too much.  I, for one, tend to find BNF easier to
read than text like the one above...  So having the ABNF nearby when
the grammar is defined in prose is useful.

> > > For each file name field in a Zip archive, the zip relative path
> > > must be encoded as either US-ASCII or UTF-8. Other encodings must
> > > not be used and if encountered a widget user agent must treat the
> > > zip archive as an invalid Zip archive.

> > > For interoperability, and where possible, encoding in US-ASCII is
> > > preferred.

> > Don't say things like that.  Either say that user agents MUST
> > support both kinds of relative paths, or pick one.

> The statement was supposed to be an authoring requirements. UTF-8
> support is not widely implemented in Zip programs so authors
> should be made aware of potential problems (particularly in
> Windows, which makes a mess of unicode file and folder names when
> they are zipped). Anyway, I've rewritten the first paragraph to
> make the whole thing more clear and precise (hopefully:)):

It's clearer, indeed.

> "For the purpose of this specification, a zip relative path is the
> variable-length string value of the File Name Field of a Local File Header
> of a zip archive (see [Zip] for definitions and details of the file name
> field and local file header). For each file name field in a zip archive, the
> zip relative path must be encoded as either US-ASCII or UTF-8; other
> encodings must not be used in a zip archive. A widget user agent must
> support US-ASCII and UTF-8 for processing the File Name Fields in a Zip
> archive. When UTF-8 encoding has been used, general purpose bit 11 of the
> local file header must be set to 1, otherwise it must be set to 0 indicating
> US-ASCII has been used. When the value of general purpose bit 11 of the
> local file header is 1, widget user agents must check that the 0x0008
> Extended Language Encoding Extra Field [ZIP] has a value of null (which
> indicates that no other encoding apart from UTF-8 was used in the File Name
> Field). If the value of 0x0008 Extended Language Encoding Extra Field is not
> null, then widget user agent must treat a Zip archive as being an invalid
> Zip archive."
> 
> 
> > > Irrespective of encoding, a zip relative path must be treated as
> > > case insensitive. As such, if a Zip archive contains two or more
> > > file names in the same folder that map to the same string
> > > following normalization on caseless matching as described in
> > > [Unicode Case Mapping], then the widget user agent must treat the
> > > zip archive as being an invalid Zip archive.

> > I seem to recall that case-insensitive comparisons outside the
> > US-ASCII range are a can of worms; at least the IDN community punted
> > on the issue.  Please consult with the i18n activity before
> > mandating UTF-8 case insensitivity.  You might be better off not
> > going down that route.

> Yes, you are right... we discussed this previously regarding
> comparing strings for versioning. I guess two options are (a) if
> the names of file x and file y are the same, then override file x
> with file y without warning... or 2, say nothing and leave it up
> to implementers. Any other options? FWIW, on Windows and MacOs
> you cannot create a Zip archive with two files with the same name
> (the system prompts you if you want to replace the file).

Well, one other option would be to say "ok, UTF-8 is not universally
deployed here, these file names are seen by developers, not by
ordinary users, so let's say it's case-insensitive US-ASCII and
dodge the issue".

Not sure if that's acceptable, though.

> 
> > > ==Rules for validating US-ASCII paths==
> >
> > Why only for US-ASCII paths, and not also for non-ASCII paths?
> 
> 
> UTF-8 paths will be defined elsewhere...
> 
> 
> > > Unless otherwise stated, any violation of the following conformance
> > > statements means that the Zip archive is non-conforming and a widget
> > user
> > > agent must treat it as an invalid Zip archive.
> > >
> > > A US-ASCII relative path is the string derived from the zip relative
> > path
> >
> > *snip*
> >
> > So how do I derive that string?  By removing all delimiters and
> > space characters, and then replacing everything else by the
> > character "z"? The result would conform to the grammar.
> 

> Sorry, I'm not sure I understand... You derive the string from reading the
> bytes held by the file name field of a local file header. The string must
> not be modified in any way, it can only be verified; I'll make sure that's
> more clear. However, do you think there should be error recovery on invalid
> folder and file names? The way I've tried to word it is that if an invalid
> path is encountered, then halt and inform the user that the zip archive is
> invalid.

I don't think there should be mandatory error recovery here, as that
only means more complexity. My point is that "string a is derived
from string b" seems to imply that there is some kind of
transformation going on (and be it \ to /); if that's the case, you
need to specify the transformation.  If it's really bit-wise
identity, then I wouldn't say that the string is derived, but just
say that it is the same string.

-- 
Thomas Roessler, W3C  <tlr@w3.org>
Received on Saturday, 24 November 2007 10:48:52 UTC