[widgets] Unicode Zip Paths (fully decomposed canonical form?)

Hi, I'm trying to put the final touches on the zip section of the
widget packaging spec [1] before we go to LC by the 10th and I've run
into an i18n problem related to character encodings. I' wondering if
anyone would be kind enough to give me some guidance as to what is
going on, encoding wise, with in MacOS with regards to the encoding of
file names in Zip Files?

When I create a zip file with one file entry called "ñ", inside the
zip file, the file name gets decomposed to the following (hex) byte
sequence:

ñ -> 0x6E 0xCC

6E is the letter "n" in Unicode, so there is obviously some
decomposition going on there. But 0xCC in Unicode maps to Ì (LATIN
CAPITAL LETTER I WITH GRAVE)? So I'm not sure what encoding the zip
file is using.

The reason I ask is because I'm not sure what to put into the widget
spec in regards to recommending the use of canonical decomposition for
unicode file names. Or even if that is a good idea!?

Should I put the following into the spec?:
"It is recommended that the file name field be encoded using [UTF-8]
in fully decomposed canonical form."

OR just:
"It is recommended that the file name field be encoded using [UTF-8]."

This seems important for when I go form MacOS to any other platform as
file names get all mangled when files are extracted on any other
platform. We obviously don't want that to happen so widget engines
need to be prepared to deal with these encoding issues.

I looked at the Zip spec [2], but I don't see any real guidance with
regards to this. However, for those who know more about encoding, it
would be helpful if you could also take a look at the Zip spec.

Any help would be greatly appreciated,
Marcos

[1] http://dev.w3.org/2006/waf/widgets/#zip-relative
[2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
-- 
Marcos Caceres
http://datadriven.com.au

Received on Saturday, 6 December 2008 00:32:02 UTC