RE: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments] from Tex Texin on 2007-11-30 (www-international@w3.org from October to December 2007)

From: Tex Texin <tex@yahoo-inc.com>
Date: Fri, 30 Nov 2007 03:28:46 -0800
To: "Marcos Caceres" <marcosscaceres@gmail.com>, "Richard Ishida" <ishida@w3.org>, <www-international@w3.org>
Cc: "Arthur Barstow" <art.barstow@nokia.com>, <public-i18n-core@w3.org>, <public-appformats@w3.org>, "Thomas Roessler" <tlr@w3.org>
Message-ID: <012AB2B223CB3F4BB846962876F47217712DD6@SNV-EXVS08.ds.corp.yahoo.com>

Marcos,

In #1 you refer to 0x7F. (Which is correct for the definition of ASCII).
In your text later, you refer to 0xFF, which is confusion. 

One improvement you can make is that if you have non-ASCII characters, you can assume UTF-8, but check that it is valid UTF-8.
Most text in CP437 won't satisfy UTF-8 encoding rules.
If you have non-ASCII characters, and it doesn't satisfy UTF-8 encoding principles, then you can assume it is CP437.

Martin Duerst published a nice Perl expression for checking UTF-8

http://www.w3.org/International/questions/qa-forms-utf-8.en.php

So in the case where the BPG11 (Bit Purpose General?) is 0, then if the name is all ASCII, treat it as either CP437 or UTF-8.
If it contains bytes >0x7F check if it satisfies UTF-8. If so, then use UTF-8. If not its CP437.

tex
 

-----Original Message-----
From: www-international-request@w3.org [mailto:www-international-request@w3.org] On Behalf Of Marcos Caceres
Sent: Friday, November 30, 2007 2:51 AM
To: Richard Ishida; www-international@w3.org
Cc: Arthur Barstow; public-i18n-core@w3.org; public-appformats@w3.org; Thomas Roessler
Subject: Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments]


Hi All,
The WAF Working group is seeking assistance with an i18n problem we are having with our Widgets 1.0 specification [1]. The issue we are having is to do with determining the encoding of file names within Zip archives. Here is an overview of the problem:

The Zip specification [2] mandates that file names in Zip files be encoded using IBM Code Page 437 (cp437), unless explicitly indicated by something called the General Purpose Bit 11 (BPG11). According to Zip, when GPB11 is "0", assume cp437; and when it is "1", assume UTF-8. In practice, however, some Zip implementations (such as MacOs') ignore setting the GPB11and encode file names in UTF-8. So, one problem is that if you adhere to the Zip spec by assuming cp437, and try to extract files from a Zip archive created on a Mac that contains characters beyond the ASCII range you get garbled text (as, obviously,
cp437 does not map to UTF-8).

The main problem is that there is no way, AFAIK, to determine if the encoding of a file names inside a Zip archive when you hit any bytes that are beyond the ASCII range (could be either cp437 or UTF-8?).

So, for our spec [3], we currently mandate that:

1. if BPG11 is 0, then you can only use US-ASCII (a-zA-Z, decimals, and some other "safe" symbols). And, if you encounter anything beyond 0x7F, then the Zip archive is in error and processing must stop (as the file names will be garbled).

2. if BPG11 is 1, then assume UTF-8.

Obviously, 1 is a bit harsh because it stops processing of the Zip archive as soon as it encounters anything beyond 0x7F (thus, any non-English speaker using any chars above 0xFF or Windows1252 will be affected... which is bad). However, it should be quite legal for Zip implementations to rely on cp437 when encoding, for example, the word "Mañana".... As you can imagine, encoding anything beyond 0xFF and assuming cp437 then you really start getting significant problems.

I guess the question is: is there any way, without changing current zip implantations or checking the GPB11, to determine if a file name is encoded using UTF-8 or cp437? If not, is it too restrictive to then mandate that only US-ASCII be used when GPB11 is 0?

I should add that the use of GPB11 in Zip was only added very recently (as of Zip version 6.3- 09/29/2006), so it is not widely implemented and probably wont be for a while.

Here is a simple example of the problem: I have "Mañana.txt".... I Zip it up using Windows Compressed Folders, I extract it on a UTF-8 file system (MacOS)....
I get:
Ma§ana.txt

I Zip "Mañana.txt" in MacOs, I unzip in Windows i get:
ManÌƒana.txt

Kind regards,
Marcos

[1] http://dev.w3.org/2006/waf/widgets/
[2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
[3] http://dev.w3.org/2006/waf/widgets/#zip-archive

--
Marcos Caceres
http://datadriven.com.au

Received on Friday, 30 November 2007 11:31:48 UTC