Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?) from Marcos Caceres on 2008-12-07 (public-i18n-core@w3.org from October to December 2008)

From: Marcos Caceres <marcosscaceres@gmail.com>
Date: Sun, 7 Dec 2008 19:18:19 +0000
To: "Martin Duerst" <duerst@it.aoyama.ac.jp>
Cc: public-webapps <public-webapps@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <b21a10670812071118i1fad0705rdc12fd72f234f327@mail.gmail.com>

Hi Martin,

On Sun, Dec 7, 2008 at 7:56 AM, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
> At 09:31 08/12/06, Marcos Caceres wrote:
>>Hi, I'm trying to put the final touches on the zip section of the widget
>>packaging spec [1] before we go to LC by the 10th and I've run into an i18n
>>problem related to character encodings. I' wondering if anyone would be
>>kind enough to give me some guidance as to what is going on, encoding wise,
>>with in MacOS with regards to the encoding of file names in Zip
>>Files?
>>
>>When I create a zip file with one file entry called "nフ�, inside the
>>zip file, the file name gets decomposed to the following (hex) byte >sequence:
>>
>>nフ�-> 0x6E 0xCC
>
> My mailer has problems with UTF-8, but my guess is that you are
> using n-with-tilde. In UTF-8 and NFD, that would be Ux6E 0xCC 0x83,
> so one explanation is that some data was dropped (and one way to
> explain that would be that the implementation was confused about
> characters vs. bytes).

Apologies, I made a mistake. I had another look and no data was
dropped by Apple's zip implementation. The byte sequence for
n-with-tilde is as you said:

Ux6E 0xCC 0x83

However, I was reading [1] and it turns out that MacOS might actually
be using their own decomposition that resembles FCD.

>>6E is the letter "n" in Unicode, so there is obviously some
>>decomposition going on there. But 0xCC in Unicode maps to
>>テ�(LATIN >CAPITAL LETTER I WITH GRAVE)? So I'm not sure what encoding the >zip file is using.
>
> A single 0xCC byte doesn't map to anything in any Unicode encoding form.
>
>>The reason I ask is because I'm not sure what to put into the widget
>>spec in regards to recommending the use of canonical decomposition for
>unicode file names. Or even if that is a good idea!?
>>
>>Should I put the following into the spec?: "It is recommended that
>>the file name field be encoded using [UTF-8] in fully decomposed canonical
>>form."
>
> No. Although the Mac file system(s?) use (a variant of) NFD,
> for file names, other operating systems (Windows, Linux,...) don't.
> If you want to specify a normalization form, NFC is closer to what
> the majority does.
>

>>OR just:
>>"It is recommended that the file name field be encoded using [UTF-8]."
>
> Realistically, that's about what you can ask for. And that should
> be enough if the main concern is to match file names from the same
> source. If you need to assure that file names from different
> sources can be matched, then proscribing NFC is the best thing
> to do, but you may have difficulties to get your developers
> following your spec.
>

Unfortunately, the concern is matching file names from different
sources:( If this is lost cause, then I will stick with "It is
recommended that the file name field be encoded using [UTF-8]."

>>This seems important for when I go form MacOS to any other platform as
>>file names get all mangled when files are extracted on any other
>>platform. We obviously don't want that to happen so widget engines
>>need to be prepared to deal with these encoding issues.
>>
>>I looked at the Zip spec [2], but I don't see any real guidance with >regards to this. However, for those who know more about encoding, it
>>would be helpful if you could also take a look at the Zip spec.
>
> It looks to me that you should say that bit 11 should be set and
> UTF-8 should be used for file name and comment, unless there are
> a significant number of zip toolkits that don't allow that.
>

I have this in the spec already, but I've been unable to determine if
any implementation actually sets general purpose bit 11.

> The spec contains the following:
>
>>>>>
> The 0x0008 Extra Field storage may be used with either setting for general
> purpose bit 11.  Examples of the intended usage for this field is to store
> whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC.
>>>>>
>
> modified-UTF-8 means that surrogates are directly converted into
> 3-byte UTF-8(-like) sequences instead of converting surrogate pairs
> into 4-byte UTF-8. UTF-8-MAC is UTF-8, mainly NFD, but NFC for Korean.
>
> The specification of the 0x0008 extra field... is extremely vague,
> not useful at all.

Yeah :(

Thank you for your help!

Kind regards,
Marcos


-- 
Marcos Caceres
http://datadriven.com.au

Received on Sunday, 7 December 2008 19:19:04 UTC