Re: [Widgets] ASCII File names - request for comments

On Nov 22, 2007 7:53 PM, Thomas Roessler <tlr@w3.org> wrote:

> (+cc Richard Ishida; Richard, despite the subject, there's an i18n
> angle to this.)
>
> On 2007-11-22 17:02:44 +1000, Marcos Caceres wrote:
>
> > The zip relative path will represent one of:
> >
> > * the name of a file (eg. index.html),
> > * the name of a folder (eg. logs/),
> > * the name of a folder within a hierarchy of folders (eg.
> styles/sounds/),
> > * or the name of a file within a hierarchy of folders (eg.
> > styles/images/background.png).
>
> Is there a BNF grammar for the zip relative path?


I've added the following a note:

"Note: The ABNF for the validation of US-ASCII-based zip relative paths is
defined in the section titled Rules for validating US-ASCII paths, and the
ABNF for the validation of UTF-8-based zip relative paths is defined in the
section titled Rules for validating UTF-8 paths."



>
> > For each file name field in a Zip archive, the zip relative path
> > must be encoded as either US-ASCII or UTF-8. Other encodings must
> > not be used and if encountered a widget user agent must treat the
> > zip archive as an invalid Zip archive.
>
> > For interoperability, and where possible, encoding in US-ASCII is
> > preferred.
>
> Don't say things like that.  Either say that user agents MUST
> support both kinds of relative paths, or pick one.
>

The statement was supposed to be an authoring requirements. UTF-8 support is
not widely implemented in Zip programs so authors should be made aware of
potential problems (particularly in Windows, which makes a mess of unicode
file and folder names when they are zipped). Anyway, I've rewritten the
first paragraph to make the whole thing more clear and precise
(hopefully:)):

"For the purpose of this specification, a zip relative path is the
variable-length string value of the File Name Field of a Local File Header
of a zip archive (see [Zip] for definitions and details of the file name
field and local file header). For each file name field in a zip archive, the
zip relative path must be encoded as either US-ASCII or UTF-8; other
encodings must not be used in a zip archive. A widget user agent must
support US-ASCII and UTF-8 for processing the File Name Fields in a Zip
archive. When UTF-8 encoding has been used, general purpose bit 11 of the
local file header must be set to 1, otherwise it must be set to 0 indicating
US-ASCII has been used. When the value of general purpose bit 11 of the
local file header is 1, widget user agents must check that the 0x0008
Extended Language Encoding Extra Field [ZIP] has a value of null (which
indicates that no other encoding apart from UTF-8 was used in the File Name
Field). If the value of 0x0008 Extended Language Encoding Extra Field is not
null, then widget user agent must treat a Zip archive as being an invalid
Zip archive."


>
> > Irrespective of encoding, a zip relative path must be treated as
> > case insensitive. As such, if a Zip archive contains two or more
> > file names in the same folder that map to the same string
> > following normalization on caseless matching as described in
> > [Unicode Case Mapping], then the widget user agent must treat the
> > zip archive as being an invalid Zip archive.
>
> I seem to recall that case-insensitive comparisons outside the
> US-ASCII range are a can of worms; at least the IDN community punted
> on the issue.  Please consult with the i18n activity before
> mandating UTF-8 case insensitivity.  You might be better off not
> going down that route.


Yes, you are right... we discussed this previously regarding comparing
strings for versioning. I guess two options are (a) if  the names of file x
and file y are the same, then override file x with file y without warning...
or 2, say nothing and leave it up to implementers. Any other options? FWIW,
on Windows and MacOs you cannot create a Zip archive with two files with the
same name (the system prompts you if you want to replace the file).


> > ==Rules for validating US-ASCII paths==
>
> Why only for US-ASCII paths, and not also for non-ASCII paths?


UTF-8 paths will be defined elsewhere...


> > Unless otherwise stated, any violation of the following conformance
> > statements means that the Zip archive is non-conforming and a widget
> user
> > agent must treat it as an invalid Zip archive.
> >
> > A US-ASCII relative path is the string derived from the zip relative
> path
>
> *snip*
>
> So how do I derive that string?  By removing all delimiters and
> space characters, and then replacing everything else by the
> character "z"? The result would conform to the grammar.


Sorry, I'm not sure I understand... You derive the string from reading the
bytes held by the file name field of a local file header. The string must
not be modified in any way, it can only be verified; I'll make sure that's
more clear. However, do you think there should be error recovery on invalid
folder and file names? The way I've tried to word it is that if an invalid
path is encountered, then halt and inform the user that the zip archive is
invalid.


-- 
Marcos Caceres
http://datadriven.com.au

Received on Friday, 23 November 2007 01:33:02 UTC