- From: Marcos Caceres <marcosscaceres@gmail.com>
- Date: Fri, 30 Nov 2007 20:51:27 +1000
- To: "Richard Ishida" <ishida@w3.org>, www-international@w3.org
- Cc: "Arthur Barstow" <art.barstow@nokia.com>, public-i18n-core@w3.org, "public-appformats@w3.org" <public-appformats@w3.org>, "Thomas Roessler" <tlr@w3.org>
Hi All, The WAF Working group is seeking assistance with an i18n problem we are having with our Widgets 1.0 specification [1]. The issue we are having is to do with determining the encoding of file names within Zip archives. Here is an overview of the problem: The Zip specification [2] mandates that file names in Zip files be encoded using IBM Code Page 437 (cp437), unless explicitly indicated by something called the General Purpose Bit 11 (BPG11). According to Zip, when GPB11 is "0", assume cp437; and when it is "1", assume UTF-8. In practice, however, some Zip implementations (such as MacOs') ignore setting the GPB11and encode file names in UTF-8. So, one problem is that if you adhere to the Zip spec by assuming cp437, and try to extract files from a Zip archive created on a Mac that contains characters beyond the ASCII range you get garbled text (as, obviously, cp437 does not map to UTF-8). The main problem is that there is no way, AFAIK, to determine if the encoding of a file names inside a Zip archive when you hit any bytes that are beyond the ASCII range (could be either cp437 or UTF-8?). So, for our spec [3], we currently mandate that: 1. if BPG11 is 0, then you can only use US-ASCII (a-zA-Z, decimals, and some other "safe" symbols). And, if you encounter anything beyond 0x7F, then the Zip archive is in error and processing must stop (as the file names will be garbled). 2. if BPG11 is 1, then assume UTF-8. Obviously, 1 is a bit harsh because it stops processing of the Zip archive as soon as it encounters anything beyond 0x7F (thus, any non-English speaker using any chars above 0xFF or Windows1252 will be affected... which is bad). However, it should be quite legal for Zip implementations to rely on cp437 when encoding, for example, the word "Maņana".... As you can imagine, encoding anything beyond 0xFF and assuming cp437 then you really start getting significant problems. I guess the question is: is there any way, without changing current zip implantations or checking the GPB11, to determine if a file name is encoded using UTF-8 or cp437? If not, is it too restrictive to then mandate that only US-ASCII be used when GPB11 is 0? I should add that the use of GPB11 in Zip was only added very recently (as of Zip version 6.3- 09/29/2006), so it is not widely implemented and probably wont be for a while. Here is a simple example of the problem: I have "Maņana.txt".... I Zip it up using Windows Compressed Folders, I extract it on a UTF-8 file system (MacOS).... I get: Ma§ana.txt I Zip "Maņana.txt" in MacOs, I unzip in Windows i get: ManĖana.txt Kind regards, Marcos [1] http://dev.w3.org/2006/waf/widgets/ [2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT [3] http://dev.w3.org/2006/waf/widgets/#zip-archive -- Marcos Caceres http://datadriven.com.au
Received on Friday, 30 November 2007 10:51:48 UTC