- From: Marcos Caceres <marcosscaceres@gmail.com>
- Date: Fri, 30 Nov 2007 21:49:41 +1000
- To: "Tex Texin" <tex@yahoo-inc.com>
- Cc: "Richard Ishida" <ishida@w3.org>, www-international@w3.org, "Arthur Barstow" <art.barstow@nokia.com>, public-i18n-core@w3.org, public-appformats@w3.org, "Thomas Roessler" <tlr@w3.org>
Hi Tex, On Nov 30, 2007 9:28 PM, Tex Texin <tex@yahoo-inc.com> wrote: > Marcos, > > In #1 you refer to 0x7F. (Which is correct for the definition of ASCII). > In your text later, you refer to 0xFF, which is confusion. Sorry, I meant that one can encode multi-byte chars... which confuses an implementation trying to read cp437. > One improvement you can make is that if you have non-ASCII characters, you can assume UTF-8, but check that it is valid UTF-8. > Most text in CP437 won't satisfy UTF-8 encoding rules. > If you have non-ASCII characters, and it doesn't satisfy UTF-8 encoding principles, then you can assume it is CP437. > > Martin Duerst published a nice Perl expression for checking UTF-8 > > http://www.w3.org/International/questions/qa-forms-utf-8.en.php Thanks, I'll see if that does the trick. If it does, I'll spec it up as a possible solution. > So in the case where the BPG11 (Bit Purpose General?) is 0, then if the name is all ASCII, treat it as either CP437 or UTF-8. > If it contains bytes >0x7F check if it satisfies UTF-8. If so, then use UTF-8. If not its CP437. Apologies, BPG11 = GPB11 (general purpose bit 11) :P Yes, that's kind what I was thinking too.... I guess it's the range 0x80-FF that is worrying me as that is the incompatible part with UTF-8; but if Martin's script solves the problem, then I might not have to worry about it too much. Kind regards, Marcos -- Marcos Caceres http://datadriven.com.au
Received on Friday, 30 November 2007 11:49:49 UTC