Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason from Jukka K. Korpela on 2013-07-02 (public-whatwg-archive@w3.org from July 2013)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Tue, 02 Jul 2013 10:05:11 +0300
To: whatwg@lists.whatwg.org
Message-ID: <51D27BA7.8060706@cs.tut.fi>
2013-07-02 2:16, Ian Hickson wrote:

> The reason that ISO-8859-1 is currently non-conforming is that the label
> no longer means "ISO-8859-1", as defined by the ISO. It actually means
> "Windows-1252".

Declaring ISO-8859-1 has no problems when the document does not contain 
bytes in the range 0x80...0x9F, as it should not. There is a huge number 
of existing pages to which this applies, and they are valid by HTML 4.01 
(or, as the case may be, XHTML 1.0) rules. Declaring all of them as 
non-conforming and issuing an error message about them does not seem to 
be useful.

You might say that such pages are risky and the risk should be 
announced, because if the page is later changed so that contains a byte 
in that range, it will not be interpreted by ISO-8859-1 but by 
windows-1252. From the perspective of tradition and practice, this is 
just about error handling. By HTML 4.01, those bytes should be 
interpreted as control characters according to ISO-8859-1, and this 
would make the document invalid, since those control characters are 
disallowed in HTML 4.01. Thus, whatever browsers do with the document 
then is error processing, and nowadays probably all browsers have chosen 
to interpret them by windows-1252.

Admittedly, in XHTML syntax it’s different since those control 
characters are not forbidden but (mostly) “just” discouraged.

I think the simplest approach would be to declare U+0080...U+009F as 
forbidden in both serializations. Then the issue could be defined purely 
in terms of error handling. If you declare ISO-8859-1 and do not have 
bytes 0x80...0x9F, fine. If you do have such a byte, we should still 
treat the encoding declaration as conforming as such, but validators 
should report the characters as errors and browsers should handle this 
error by interpreting the document as if the declared encoding were 
windows-1252.

> It seems bad, and maybe rather full of hubris, to make it conforming to
> use a label that we know will be interpreted in a manner that is a willful
> violation of its spec (that is, the ISO spec).

In most cases, there is no violation of the ISO standard. Or, to put it 
in another way, taking ISO-8859-1 as a synonym for windows-1252 is fully 
compatible with the ISO 8859-1 standard as long as the document does not 
contain data that would be interpreted by ISO 8859-1 as C1 Controls 
(U+0080...U+009F), which it should not contain.

> I would rather go back to having the conflicts be caught by validators
> than just throw the ISO spec under the bus, but it's really up to you
> (Henri, and whoever else is implementing a validator).

Consider a typical case. Joe Q. Author is using ISO-8859-1 as he has 
done for years, and remains happy, until he tries to validate his page 
as HTML5. Is it useful that he gets an error message (and gets 
confused), even though his data is all ISO-8859-1 (without C1 Controls)? 
Suppose then than he accidentally enters, say, the euro sign “€” because 
his text editor or other authoring tool lets him do – and stores it as 
windows-1252 encoded. Even then, no practical problem arises, due to the 
common error handling behavior, but at this point, it might be useful to 
give some diagnostic if the document is being validated.

I would say that even then a warning about the problem would be 
sufficient, but it could be treated as an error – as a data error, with 
defined error handling. The occurrences of the offending bytes should be 
reported (which is what now happens when validating as HTML 4.01, even 
though the error messages are cryptic, like “non SGML character number 
128”). The author might then decide to declare the encoding as windows-1252.

But even though the most common cause of such a situation is an attempt 
to use (mostly due to ignorance) certain characters without realizing 
that they do not exist in ISO-8859-1, it might be a symptom of some 
different problem, like malformed data unintentionally appearing in a 
document. It is thus useful to draw the author’s attention to specific 
problems, incorrect data where it appears, rather than blindly taking 
ISO-8859-1 as windows-1252.

Yucca
Received on Tuesday, 2 July 2013 07:05:41 UTC