- From: Řistein E. Andersen <liszt@coq.no>
- Date: Sun, 12 Apr 2009 11:08:34 +0100
On 2 Sep 2008, at 06:06, Ian Hickson wrote: > On Wed, 30 Jul 2008, ?istein E. Andersen wrote: >> >> 1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252. >> IE7, on the other hand, simply ignores the high bit (as it does >> for >> a few other 7-bit encodings, by the way). Perhaps this >> alias could be dropped from the other browsers. > > Ignoring the high bit seems like a dangerous security bug; dropping > any > character with a high bit as U+FFFD seems unnecessarily drastic. According to a test I did using browsershots.org, IE8 actually seems to do this (8-bit characters are rendered as squares), which looks like an argument in favour of the more `drastic' option. > I've made the spec go with the O/F/S behaviour here. This has the advantage of not adding ASCII as a separate encoding, and Windows-1252 is presumably one of the encodings most often mislabelled as ASCII. However, IE has ignored the high bit at least since 5.01 (IE4 via browsershots.org treats it as CP1252, but this could well be locale-dependent), so there may not be that many mislabelled pages. Has anyone got a list of pages which are labelled as ASCII and contain 8-bit characters? This is probably not very important. U+FFFD is `purer', Windows-1252 has the potential of rescuing a few pages. It is however essential that 8-bit characters be considered not conforming since they do not in fact work (as Windows-1252 bytes) in IE5-IE8. This is currently the case, but I think Henri Sivonen has argued that `misinterpretation for compatibility' should not be considered a conformance error (which would probably be fairly harmless for other mappings). >> 4. Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite >> inconsistently; [...] >> > > I think the HTML5 spec does what is necessary here, but it may be > that the > encodings specs are vague still. [For the record, HTML5 currently requires delete and C1 characters (as well as C0 save white space) to be replaced by U+FFFD during `pre- processing of the input stream', which effectively circumvents the problem that character encoding specifications treat this range in a vague and inconsistent manner.] -- ?istein E. Andersen
Received on Sunday, 12 April 2009 03:08:34 UTC