Re: I18N issues for Widgets Spec [Was: Re: [Widgets] ASCII File names - request for comments] from Marcos Caceres on 2007-11-30 (public-appformats@w3.org from November 2007)

From: Marcos Caceres <marcosscaceres@gmail.com>
Date: Fri, 30 Nov 2007 21:49:41 +1000
To: "Tex Texin" <tex@yahoo-inc.com>
Cc: "Richard Ishida" <ishida@w3.org>, www-international@w3.org, "Arthur Barstow" <art.barstow@nokia.com>, public-i18n-core@w3.org, public-appformats@w3.org, "Thomas Roessler" <tlr@w3.org>
Message-ID: <b21a10670711300349p40d8dfe3kd90372d7d614ba97@mail.gmail.com>

Hi Tex,

On Nov 30, 2007 9:28 PM, Tex Texin <tex@yahoo-inc.com> wrote:
> Marcos,
>
> In #1 you refer to 0x7F. (Which is correct for the definition of ASCII).
> In your text later, you refer to 0xFF, which is confusion.

Sorry, I meant that one can encode multi-byte chars... which confuses
an implementation trying to read cp437.

> One improvement you can make is that if you have non-ASCII characters, you can assume UTF-8, but check that it is valid UTF-8.
> Most text in CP437 won't satisfy UTF-8 encoding rules.
> If you have non-ASCII characters, and it doesn't satisfy UTF-8 encoding principles, then you can assume it is CP437.
>
> Martin Duerst published a nice Perl expression for checking UTF-8
>
> http://www.w3.org/International/questions/qa-forms-utf-8.en.php

Thanks, I'll see if that does the trick. If it does, I'll spec it up
as a possible solution.

> So in the case where the BPG11 (Bit Purpose General?) is 0, then if the name is all ASCII, treat it as either CP437 or UTF-8.
> If it contains bytes >0x7F check if it satisfies UTF-8. If so, then use UTF-8. If not its CP437.

Apologies, BPG11 = GPB11 (general purpose bit 11) :P

Yes, that's kind what I was thinking too.... I guess it's the range
0x80-FF that is worrying me as that is the incompatible part with
UTF-8; but if Martin's script solves the problem, then I might not
have to worry about it too much.

Kind regards,
Marcos
-- 
Marcos Caceres
http://datadriven.com.au

Received on Friday, 30 November 2007 11:50:04 UTC