- From: Maciej Stachowiak <mjs@apple.com>
- Date: Tue, 5 Jun 2007 09:59:49 -0700
On Jun 5, 2007, at 12:18 AM, Henri Sivonen wrote: > On May 29, 2007, at 13:13, Henri Sivonen wrote: > >> To avoid stepping on the toes of Charmod more than is necessary, I >> suggest making it non-conforming for a document to have bytes in >> the 0x80?0x9F range when the character encoding is declared to be >> one of the ISO-8859 family encodings. > > I've been thinking about this. I have a proposal on how to spec > this *conceptually* and how to implement this with error reporting. > I am assuming here that 1) No one ever intends C1 code points to be > present in the decoded stream and 2) we want, as a Charmod > correctness fig leaf, to make the C1 bytes non-conforming when > ISO-8859-1 or ISO-8859-11 was declared but Windows-1252 or > Windows-874 decoding is needed. > > Based on the behavior of Minefield and Opera 9.20, the following > seems to be the least Charmod violating and least quirky approach > that could possibly work: > > 1) Decode the byte stream using a decoder for whatever encoding was > declared, even ISO-8859-1 or ISO-8859-11, according to ftp:// > ftp.unicode.org/Public/MAPPINGS/. > 2) If a character in the decoded character stream is in the C1 code > point range, this is a document conformance violation. > 2a) If the declared encoding was ISO-8859-1, replace that > character with the character that you get by casting the code point > into a byte and decoding it as Windows-1252. > 2b) If the declared encoding was ISO-8859-11, replace that > character with the character that you get by casting the code point > into a byte and decoding it as Windows-874. > > > [ > The *simplest* and most robust (and maximally Charmod-violating) > thing would be: > > 1) Decode the byte stream using a decoder for whatever encoding was > declared, even ISO-8859-1 or ISO-8859-11, according to ftp:// > ftp.unicode.org/Public/MAPPINGS/. > 2) If a character in the decoded character stream is in the C1 code > point range, this is a document conformance violation. Replace that > character with the character that you get by casting the code point > into a byte and decoding it as Windows-1252. > > But this isn't what Minefield, Opera 9.20 and WebKit nightlies do. > ] What we actually do in WebKit is always use a windows-1252 decoder when ISO-8859-1 is requested. I don't think it's very helpful to make all documents that declare a ISO-8859-1 encoding and use characters in the C1 range nonconforming. It's true that they are counting on nonstandard processing of the nominally declared encoding, but I don't think that causes a problem in practice, as long as the rule is well known. It seems simpler to just make latin1 an alias for winlatin1. Regards, Maciej
Received on Tuesday, 5 June 2007 09:59:49 UTC