- From: Ian Hickson <ian@hixie.ch>
- Date: Tue, 2 Sep 2008 05:06:22 +0000 (UTC)
On Wed, 30 Jul 2008, ?istein E. Andersen wrote: > > The current table seems to cover the mappings between different common > compatible 8-bit encodings as implemented in IE7, yes. The table at > <http://coq.no/character-tables/mime/en> gives a bit more detail, most > of which is better kept outside HTML5 itself. However, the following > observations can be made: > > 1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252. > IE7, on the other hand, simply ignores the high bit (as it does for > a few other 7-bit encodings, by the way). Perhaps this > alias could be dropped from the other browsers. Ignoring the high bit seems like a dangerous security bug; dropping any character with a high bit as U+FFFD seems unnecessarily drastic. I've made the spec go with the O/F/S behaviour here. > 2. Firefox and Opera seem to sniff for text/plain; charset=ISO-8859-1 (as per HTML5), > whereas Safari seems to do the same for text/plain; charset=ISO-8859-11 > instead [Version 3.1.2 (5525.20.1)]. Bug? I believe so. > 3. For certain character sets, different browsers map to different, but visually > similar Unicode characters. Sometimes, one mapping is old/outdated, > but this is not always the case. Not sure what I can do about that. > 4. Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite inconsistently; > different browsers do different things for the same encoding, and the same > browser gives analogous encodings different treatment. > > (For the early ISO-8859-* encodings, the IANA registry points to RFC 1345, > which effectively maps 0x7F--0x9F to U+7F--U+9F, but does not really > seem to regard this feature as an essential part of the character set: > > the charset is often coded with both > graphical and control character sets. If the coded character set is > a 96-character set, it is tabled with the relevant GL set (normally > ISO-IR-6) and with ISO 6429 as C0 and C1 > > As for the Windows-* encodings, Microsoft documentation treats bytes > in this range as unassigned unless they are mapped to graphical characters, > whereas Microsoft products return the underlying byte value in this case.) I think the HTML5 spec does what is necessary here, but it may be that the encodings specs are vague still. > 5. IE handles KOI8-U as KOI8-RU, whereas Safari does the opposite. The former > is probably more reasonable (assuming that letters are more important than > line-drawing characters), but neither is actually correct given that the encodings > are, strictly speaking, incompatible. This issue will of course look a bit different > if it can be shown that documents containing the letter ??/?? (only in KOI8-RU) > are frequently mislabelled as KOI8-U. I guess we'll see what feedback we get on this when testing begins. Cheers, -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 1 September 2008 22:06:22 UTC