- From: Marcos Caceres <marcosscaceres@gmail.com>
- Date: Mon, 3 Dec 2007 11:08:29 +1000
- To: "Asmus Freytag" <asmusf@ix.netcom.com>
- Cc: "Richard Ishida" <ishida@w3.org>, www-international@w3.org, "Arthur Barstow" <art.barstow@nokia.com>, public-i18n-core@w3.org, "public-appformats@w3.org" <public-appformats@w3.org>, "Thomas Roessler" <tlr@w3.org>
On Nov 30, 2007 9:56 PM, Asmus Freytag <asmusf@ix.netcom.com> wrote: > One thing to realize is that in UTF-8 you can never have a single > non-ASCII byte. It's only ever two or more in sequence. However, most > European languages that use non-ASCII characters, typically do so with > single non-ASCII characters. > (non-European languages normally can't be represented in CP437, so we > don't worry about them). > > Your example, which I've copied below (I hope it comes through on the > repost) shows this effect quite nicely. > > In addition, the valid ranges of first and following bytes in multi-byte > sequences of non-ASCII UTF-8 bytes are restricted, making it even harder > for a random pattern to be valid UTF-8. > > As a result, if a filename is legal UTF-8, it is highly unlikely that it > could have a reasonable alternative interpretation in CP437. Therefore, > treating the archive as corrupt if it contains non-ASCII bytes, would > seem a bit draconian for many types of uses. (Your case may be special). > > If, on the other hand, there's ever any doubt as to which *single-byte* > character set you are dealing with, i.e. if you find systems that use > non-CP437 and non-UTF-8, but something third, then I'd recommend your > approach, because discriminating between different singly byte character > sets is something that ranges from impractical to impossible. Thanks for this info, it's been very useful. Kind regards, -- Marcos Caceres http://datadriven.com.au
Received on Monday, 3 December 2007 01:08:45 UTC