Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments]

On Nov 30, 2007 9:56 PM, Asmus Freytag <asmusf@ix.netcom.com> wrote:
> One thing to realize is that in UTF-8 you can never have a single
> non-ASCII byte. It's only ever two or more in sequence. However, most
> European languages that use non-ASCII characters, typically do so with
> single non-ASCII characters.
> (non-European languages normally can't be represented in CP437, so we
> don't worry about them).
>
> Your example, which I've copied below (I hope it comes through on the
> repost) shows this effect quite nicely.
>
> In addition, the valid ranges of first and following bytes in multi-byte
> sequences of non-ASCII UTF-8 bytes are restricted, making it even harder
> for a random pattern to be valid UTF-8.
>
> As a result, if a filename is legal UTF-8, it is highly unlikely that it
> could have a reasonable alternative interpretation in CP437. Therefore,
> treating the archive as corrupt if it contains non-ASCII bytes, would
> seem a bit draconian for many types of uses. (Your case may be special).
>
> If, on the other hand, there's ever any doubt as to which *single-byte*
> character set you are dealing with, i.e. if you find systems that use
> non-CP437 and non-UTF-8, but something third, then I'd recommend your
> approach, because discriminating between different singly byte character
> sets is something that ranges from impractical to impossible.

Thanks for this info, it's been very useful.

Kind regards,
-- 
Marcos Caceres
http://datadriven.com.au

Received on Monday, 3 December 2007 01:08:47 UTC