- From: Marcos Caceres <marcosscaceres@gmail.com>
- Date: Sun, 7 Dec 2008 19:18:19 +0000
- To: "Martin Duerst" <duerst@it.aoyama.ac.jp>
- Cc: public-webapps <public-webapps@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Hi Martin, On Sun, Dec 7, 2008 at 7:56 AM, Martin Duerst <duerst@it.aoyama.ac.jp> wrote: > At 09:31 08/12/06, Marcos Caceres wrote: >>Hi, I'm trying to put the final touches on the zip section of the widget >>packaging spec [1] before we go to LC by the 10th and I've run into an i18n >>problem related to character encodings. I' wondering if anyone would be >>kind enough to give me some guidance as to what is going on, encoding wise, >>with in MacOS with regards to the encoding of file names in Zip >>Files? >> >>When I create a zip file with one file entry called "nフ�, inside the >>zip file, the file name gets decomposed to the following (hex) byte >sequence: >> >>nフ�-> 0x6E 0xCC > > My mailer has problems with UTF-8, but my guess is that you are > using n-with-tilde. In UTF-8 and NFD, that would be Ux6E 0xCC 0x83, > so one explanation is that some data was dropped (and one way to > explain that would be that the implementation was confused about > characters vs. bytes). Apologies, I made a mistake. I had another look and no data was dropped by Apple's zip implementation. The byte sequence for n-with-tilde is as you said: Ux6E 0xCC 0x83 However, I was reading [1] and it turns out that MacOS might actually be using their own decomposition that resembles FCD. >>6E is the letter "n" in Unicode, so there is obviously some >>decomposition going on there. But 0xCC in Unicode maps to >>テ�(LATIN >CAPITAL LETTER I WITH GRAVE)? So I'm not sure what encoding the >zip file is using. > > A single 0xCC byte doesn't map to anything in any Unicode encoding form. > >>The reason I ask is because I'm not sure what to put into the widget >>spec in regards to recommending the use of canonical decomposition for >unicode file names. Or even if that is a good idea!? >> >>Should I put the following into the spec?: "It is recommended that >>the file name field be encoded using [UTF-8] in fully decomposed canonical >>form." > > No. Although the Mac file system(s?) use (a variant of) NFD, > for file names, other operating systems (Windows, Linux,...) don't. > If you want to specify a normalization form, NFC is closer to what > the majority does. > >>OR just: >>"It is recommended that the file name field be encoded using [UTF-8]." > > Realistically, that's about what you can ask for. And that should > be enough if the main concern is to match file names from the same > source. If you need to assure that file names from different > sources can be matched, then proscribing NFC is the best thing > to do, but you may have difficulties to get your developers > following your spec. > Unfortunately, the concern is matching file names from different sources:( If this is lost cause, then I will stick with "It is recommended that the file name field be encoded using [UTF-8]." >>This seems important for when I go form MacOS to any other platform as >>file names get all mangled when files are extracted on any other >>platform. We obviously don't want that to happen so widget engines >>need to be prepared to deal with these encoding issues. >> >>I looked at the Zip spec [2], but I don't see any real guidance with >regards to this. However, for those who know more about encoding, it >>would be helpful if you could also take a look at the Zip spec. > > It looks to me that you should say that bit 11 should be set and > UTF-8 should be used for file name and comment, unless there are > a significant number of zip toolkits that don't allow that. > I have this in the spec already, but I've been unable to determine if any implementation actually sets general purpose bit 11. > The spec contains the following: > >>>>> > The 0x0008 Extra Field storage may be used with either setting for general > purpose bit 11. Examples of the intended usage for this field is to store > whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC. >>>>> > > modified-UTF-8 means that surrogates are directly converted into > 3-byte UTF-8(-like) sequences instead of converting surrogate pairs > into 4-byte UTF-8. UTF-8-MAC is UTF-8, mainly NFD, but NFC for Korean. > > The specification of the 0x0008 extra field... is extremely vague, > not useful at all. Yeah :( Thank you for your help! Kind regards, Marcos -- Marcos Caceres http://datadriven.com.au
Received on Sunday, 7 December 2008 19:19:08 UTC