Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments] from Marcos Caceres on 2007-12-03 (www-international@w3.org from October to December 2007)

From: Marcos Caceres <marcosscaceres@gmail.com>
Date: Mon, 3 Dec 2007 11:08:29 +1000
To: "Asmus Freytag" <asmusf@ix.netcom.com>
Cc: "Richard Ishida" <ishida@w3.org>, www-international@w3.org, "Arthur Barstow" <art.barstow@nokia.com>, public-i18n-core@w3.org, "public-appformats@w3.org" <public-appformats@w3.org>, "Thomas Roessler" <tlr@w3.org>
Message-ID: <b21a10670712021708k1849b197y5a13422ecb17cd06@mail.gmail.com>

On Nov 30, 2007 9:56 PM, Asmus Freytag <asmusf@ix.netcom.com> wrote:
> One thing to realize is that in UTF-8 you can never have a single
> non-ASCII byte. It's only ever two or more in sequence. However, most
> European languages that use non-ASCII characters, typically do so with
> single non-ASCII characters.
> (non-European languages normally can't be represented in CP437, so we
> don't worry about them).
>
> Your example, which I've copied below (I hope it comes through on the
> repost) shows this effect quite nicely.
>
> In addition, the valid ranges of first and following bytes in multi-byte
> sequences of non-ASCII UTF-8 bytes are restricted, making it even harder
> for a random pattern to be valid UTF-8.
>
> As a result, if a filename is legal UTF-8, it is highly unlikely that it
> could have a reasonable alternative interpretation in CP437. Therefore,
> treating the archive as corrupt if it contains non-ASCII bytes, would
> seem a bit draconian for many types of uses. (Your case may be special).
>
> If, on the other hand, there's ever any doubt as to which *single-byte*
> character set you are dealing with, i.e. if you find systems that use
> non-CP437 and non-UTF-8, but something third, then I'd recommend your
> approach, because discriminating between different singly byte character
> sets is something that ranges from impractical to impossible.

Thanks for this info, it's been very useful.

Kind regards,
-- 
Marcos Caceres
http://datadriven.com.au

Received on Monday, 3 December 2007 01:08:51 UTC