- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Tue, 02 Jul 2013 10:05:11 +0300
- To: whatwg@lists.whatwg.org
2013-07-02 2:16, Ian Hickson wrote: > The reason that ISO-8859-1 is currently non-conforming is that the label > no longer means "ISO-8859-1", as defined by the ISO. It actually means > "Windows-1252". Declaring ISO-8859-1 has no problems when the document does not contain bytes in the range 0x80...0x9F, as it should not. There is a huge number of existing pages to which this applies, and they are valid by HTML 4.01 (or, as the case may be, XHTML 1.0) rules. Declaring all of them as non-conforming and issuing an error message about them does not seem to be useful. You might say that such pages are risky and the risk should be announced, because if the page is later changed so that contains a byte in that range, it will not be interpreted by ISO-8859-1 but by windows-1252. From the perspective of tradition and practice, this is just about error handling. By HTML 4.01, those bytes should be interpreted as control characters according to ISO-8859-1, and this would make the document invalid, since those control characters are disallowed in HTML 4.01. Thus, whatever browsers do with the document then is error processing, and nowadays probably all browsers have chosen to interpret them by windows-1252. Admittedly, in XHTML syntax it’s different since those control characters are not forbidden but (mostly) “just” discouraged. I think the simplest approach would be to declare U+0080...U+009F as forbidden in both serializations. Then the issue could be defined purely in terms of error handling. If you declare ISO-8859-1 and do not have bytes 0x80...0x9F, fine. If you do have such a byte, we should still treat the encoding declaration as conforming as such, but validators should report the characters as errors and browsers should handle this error by interpreting the document as if the declared encoding were windows-1252. > It seems bad, and maybe rather full of hubris, to make it conforming to > use a label that we know will be interpreted in a manner that is a willful > violation of its spec (that is, the ISO spec). In most cases, there is no violation of the ISO standard. Or, to put it in another way, taking ISO-8859-1 as a synonym for windows-1252 is fully compatible with the ISO 8859-1 standard as long as the document does not contain data that would be interpreted by ISO 8859-1 as C1 Controls (U+0080...U+009F), which it should not contain. > I would rather go back to having the conflicts be caught by validators > than just throw the ISO spec under the bus, but it's really up to you > (Henri, and whoever else is implementing a validator). Consider a typical case. Joe Q. Author is using ISO-8859-1 as he has done for years, and remains happy, until he tries to validate his page as HTML5. Is it useful that he gets an error message (and gets confused), even though his data is all ISO-8859-1 (without C1 Controls)? Suppose then than he accidentally enters, say, the euro sign “€” because his text editor or other authoring tool lets him do – and stores it as windows-1252 encoded. Even then, no practical problem arises, due to the common error handling behavior, but at this point, it might be useful to give some diagnostic if the document is being validated. I would say that even then a warning about the problem would be sufficient, but it could be treated as an error – as a data error, with defined error handling. The occurrences of the offending bytes should be reported (which is what now happens when validating as HTML 4.01, even though the error messages are cryptic, like “non SGML character number 128”). The author might then decide to declare the encoding as windows-1252. But even though the most common cause of such a situation is an attempt to use (mostly due to ignorance) certain characters without realizing that they do not exist in ISO-8859-1, it might be a symptom of some different problem, like malformed data unintentionally appearing in a document. It is thus useful to draw the author’s attention to specific problems, incorrect data where it appears, rather than blindly taking ISO-8859-1 as windows-1252. Yucca
Received on Tuesday, 2 July 2013 07:05:41 UTC