Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments] from Marcos Caceres on 2007-11-30 (public-i18n-core@w3.org from October to December 2007)

From: Marcos Caceres <marcosscaceres@gmail.com>
Date: Fri, 30 Nov 2007 20:51:27 +1000
To: "Richard Ishida" <ishida@w3.org>, www-international@w3.org
Cc: "Arthur Barstow" <art.barstow@nokia.com>, public-i18n-core@w3.org, "public-appformats@w3.org" <public-appformats@w3.org>, "Thomas Roessler" <tlr@w3.org>
Message-ID: <b21a10670711300251v78ec50d2j1ba364398f0894da@mail.gmail.com>

Hi All,
The WAF Working group is seeking assistance with an i18n problem we
are having with our Widgets 1.0 specification [1]. The issue we are
having is to do with determining the encoding of file names within Zip
archives. Here is an overview of the problem:

The Zip specification [2] mandates that file names in Zip files be
encoded using IBM Code Page 437 (cp437), unless explicitly indicated
by something called the General Purpose Bit 11 (BPG11). According to
Zip, when GPB11 is "0", assume cp437; and when it is "1", assume
UTF-8. In practice, however, some Zip implementations (such as MacOs')
ignore setting the GPB11and encode file names in UTF-8. So, one
problem is that if you adhere to the Zip spec by assuming cp437, and
try to extract files from a Zip archive created on a Mac that contains
characters beyond the ASCII range you get garbled text (as, obviously,
cp437 does not map to UTF-8).

The main problem is that there is no way, AFAIK, to determine if the
encoding of a file names inside a Zip archive when you hit any bytes
that are beyond the ASCII range (could be either cp437 or UTF-8?).

So, for our spec [3], we currently mandate that:

1. if BPG11 is 0, then you can only use US-ASCII (a-zA-Z, decimals,
and some other "safe" symbols). And, if you encounter anything beyond
0x7F, then the Zip archive is in error and processing must stop (as
the file names will be garbled).

2. if BPG11 is 1, then assume UTF-8.

Obviously, 1 is a bit harsh because it stops processing of the Zip
archive as soon as it encounters anything beyond 0x7F (thus, any
non-English speaker using any chars above 0xFF or Windows1252 will be
affected... which is bad). However, it should be quite legal for Zip
implementations to rely on cp437 when encoding, for example, the word
"Mañana".... As you can imagine, encoding anything beyond 0xFF and
assuming cp437 then you really start getting significant problems.

I guess the question is: is there any way, without changing current
zip implantations or checking the GPB11, to determine if a file name
is encoded using UTF-8 or cp437? If not, is it too restrictive to then
mandate that only US-ASCII be used when GPB11 is 0?

I should add that the use of GPB11 in Zip was only added very recently
(as of Zip version 6.3- 09/29/2006), so it is not widely implemented
and probably wont be for a while.

Here is a simple example of the problem: I have "Mañana.txt".... I Zip
it up using Windows Compressed Folders, I extract it on a UTF-8 file
system (MacOS)....
I get:
Ma§ana.txt

I Zip "Mañana.txt" in MacOs, I unzip in Windows i get:
ManÌƒana.txt

Kind regards,
Marcos

[1] http://dev.w3.org/2006/waf/widgets/
[2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
[3] http://dev.w3.org/2006/waf/widgets/#zip-archive

-- 
Marcos Caceres
http://datadriven.com.au

Received on Friday, 30 November 2007 10:51:38 UTC