Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments]

One thing to realize is that in UTF-8 you can never have a single 
non-ASCII byte. It's only ever two or more in sequence. However, most 
European languages that use non-ASCII characters, typically do so with 
single non-ASCII characters.
(non-European languages normally can't be represented in CP437, so we 
don't worry about them).

Your example, which I've copied below (I hope it comes through on the 
repost) shows this effect quite nicely.

In addition, the valid ranges of first and following bytes in multi-byte 
sequences of non-ASCII UTF-8 bytes are restricted, making it even harder 
for a random pattern to be valid UTF-8.

As a result, if a filename is legal UTF-8, it is highly unlikely that it 
could have a reasonable alternative interpretation in CP437. Therefore, 
treating the archive as corrupt if it contains non-ASCII bytes, would 
seem a bit draconian for many types of uses. (Your case may be special).

If, on the other hand, there's ever any doubt as to which *single-byte* 
character set you are dealing with, i.e. if you find systems that use 
non-CP437 and non-UTF-8, but something third, then I'd recommend your 
approach, because discriminating between different singly byte character 
sets is something that ranges from impractical to impossible.

A./

> Here is a simple example of the problem: I have "Maņana.txt".... I Zip
> it up using Windows Compressed Folders, I extract it on a UTF-8 file
> system (MacOS)....
> I get:
> Ma§ana.txt
>
> I Zip "Maņana.txt" in MacOs, I unzip in Windows i get:
> ManĖƒana.txt
>
> Kind regards,
> Marcos
>
> [1] http://dev.w3.org/2006/waf/widgets/
> [2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
> [3] http://dev.w3.org/2006/waf/widgets/#zip-archive
>
>   

Received on Friday, 30 November 2007 11:57:32 UTC