Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?)

At 09:31 08/12/06, Marcos Caceres wrote:
>Hi, I'm trying to put the final touches on the zip section of the widget 
>packaging spec [1] before we go to LC by the 10th and I've run into an i18n 
>problem related to character encodings. I' wondering if anyone would be 
>kind enough to give me some guidance as to what is going on, encoding wise, 
>with in MacOS with regards to the encoding of file names in Zip 
>Files?
>
>When I create a zip file with one file entry called "n$B%U%(B, inside the
>zip file, the file name gets decomposed to the following (hex) byte >sequence:
>
>n$B%U%(B-> 0x6E 0xCC

My mailer has problems with UTF-8, but my guess is that you are
using n-with-tilde. In UTF-8 and NFD, that would be Ux6E 0xCC 0x83,
so one explanation is that some data was dropped (and one way to
explain that would be that the implementation was confused about
characters vs. bytes).

>6E is the letter "n" in Unicode, so there is obviously some
>decomposition going on there. But 0xCC in Unicode maps to 
>$B%F7(B(LATIN >CAPITAL LETTER I WITH GRAVE)? So I'm not sure what encoding the >zip file is using.

A single 0xCC byte doesn't map to anything in any Unicode encoding form.

>The reason I ask is because I'm not sure what to put into the widget
>spec in regards to recommending the use of canonical decomposition for >unicode file names. Or even if that is a good idea!?
>
>Should I put the following into the spec?: "It is recommended that 
>the file name field be encoded using [UTF-8] in fully decomposed canonical 
>form."

No. Although the Mac file system(s?) use (a variant of) NFD,
for file names, other operating systems (Windows, Linux,...) don't.
If you want to specify a normalization form, NFC is closer to what
the majority does.

>OR just:
>"It is recommended that the file name field be encoded using [UTF-8]."

Realistically, that's about what you can ask for. And that should
be enough if the main concern is to match file names from the same
source. If you need to assure that file names from different
sources can be matched, then proscribing NFC is the best thing
to do, but you may have difficulties to get your developpers
following your spec.

>This seems important for when I go form MacOS to any other platform as
>file names get all mangled when files are extracted on any other
>platform. We obviously don't want that to happen so widget engines
>need to be prepared to deal with these encoding issues.
>
>I looked at the Zip spec [2], but I don't see any real guidance with >regards to this. However, for those who know more about encoding, it
>would be helpful if you could also take a look at the Zip spec.

It looks to me that you should say that bit 11 should be set and
UTF-8 should be used for file name and comment, unless there are
a significant number of zip toolkits that don't allow that.

The spec contains the following:

>>>>
The 0x0008 Extra Field storage may be used with either setting for general 
purpose bit 11.  Examples of the intended usage for this field is to store 
whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC.
>>>>

modified-UTF-8 means that surrogates are directly converted into
3-byte UTF-8(-like) sequences instead of converting surrogate pairs
into 4-byte UTF-8. UTF-8-MAC is UTF-8, mainly NFD, but NFC for Korean.

The specification of the 0x0008 extra field... is extremely vague,
not useful at all.

Regards,    Martin.

>Any help would be greatly appreciated,
>Marcos
>
>[1] http://dev.w3.org/2006/waf/widgets/#zip-relative
>[2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
>--
>Marcos Caceres http://datadriven.com.au


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     

Received on Sunday, 7 December 2008 07:57:47 UTC