- From: Asmus Freytag <asmusf@ix.netcom.com>
- Date: Fri, 30 Nov 2007 03:56:45 -0800
- To: Marcos Caceres <marcosscaceres@gmail.com>
- CC: Richard Ishida <ishida@w3.org>, www-international@w3.org, Arthur Barstow <art.barstow@nokia.com>, public-i18n-core@w3.org, "public-appformats@w3.org" <public-appformats@w3.org>, Thomas Roessler <tlr@w3.org>
One thing to realize is that in UTF-8 you can never have a single non-ASCII byte. It's only ever two or more in sequence. However, most European languages that use non-ASCII characters, typically do so with single non-ASCII characters. (non-European languages normally can't be represented in CP437, so we don't worry about them). Your example, which I've copied below (I hope it comes through on the repost) shows this effect quite nicely. In addition, the valid ranges of first and following bytes in multi-byte sequences of non-ASCII UTF-8 bytes are restricted, making it even harder for a random pattern to be valid UTF-8. As a result, if a filename is legal UTF-8, it is highly unlikely that it could have a reasonable alternative interpretation in CP437. Therefore, treating the archive as corrupt if it contains non-ASCII bytes, would seem a bit draconian for many types of uses. (Your case may be special). If, on the other hand, there's ever any doubt as to which *single-byte* character set you are dealing with, i.e. if you find systems that use non-CP437 and non-UTF-8, but something third, then I'd recommend your approach, because discriminating between different singly byte character sets is something that ranges from impractical to impossible. A./ > Here is a simple example of the problem: I have "Maņana.txt".... I Zip > it up using Windows Compressed Folders, I extract it on a UTF-8 file > system (MacOS).... > I get: > Ma§ana.txt > > I Zip "Maņana.txt" in MacOs, I unzip in Windows i get: > ManĖana.txt > > Kind regards, > Marcos > > [1] http://dev.w3.org/2006/waf/widgets/ > [2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT > [3] http://dev.w3.org/2006/waf/widgets/#zip-archive > >
Received on Friday, 30 November 2007 11:57:33 UTC