- From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
- Date: Fri, 31 Jul 2009 20:10:16 +0200
- To: public-html-comments@w3.org
Bil Corry: > Just so I'm clear, the problem you are defining is that HTML5 requires > browsers to interpret ISO-8859-1 as Windows-1252 [1], and traditionally > authors that used Windows-1252 encoding, but specified ISO-8859-1 for the > character set, were given feedback of the mismatch because the browser > would fail to render the correct glyph. And because this feedback has been > removed in HTML5 (the correct glyph is shown despite the charset mismatch), > it goes unnoticed by the author and the page remains misidentified as > ISO-8859-1 when it's really Windows-1252. Is that correct? > With the still open questions I mainly try to find out whether it is possible to specify, that a 'HTML5' document has an encoding like 'ISO-8859-1' and not 'Windows-1252'. As far as I understand the specification, this is not possible, but then the other questions become interesting, because typically it is relevant what the server indicates, not what is mentioned explictely or implicitely in the document. And if 'HTML5' changes the relevance (what is not necessarily bad) of the encoding information, I want to know of course, how a viewer/browser identifies a document as 'HTML5', because for other formats or versions this behaviour is simply a bug and not a feature. This is more a problem of the missing version indication, for example the newest XHTML variant has the attribute version="XHTML+RDFa 1.0", the formal identification problem would be solved for 'HTML5' for example with version="HTML5", then the document is at least well defined and specific rules are applicable, if the specification is known to the decoder. A proper browser of course has to search for such a version indication to know, which encoding information applies, before the document is presented. Even if current browsers do it differently (wrong), this is no reason, that there is no rule required to define a correct way. > It may be worthwhile to point out that Firefox 2 & 3 and Internet Explorer > 7 & 8 all render content encoded as Windows-1252 but identified as > ISO-8859-1 as Windows-1252 -- the exact behavior of HTML5. And it's > possible older versions of FF/IE and other browsers also have this > behavior. So the lack of feedback already exists, which may explain the > preponderance of pages misidentified as ISO-8869-1. > > Given the above, I'd argue that HTML5 is more compatible with existing > behavior than it would be if it required strictly render of ISO-8859-1. > > If you are curious, you can test your browser to see how it renders UTF-8 > and Windows-1252 byte streams given a particular encoding: > > http://www.corry.biz/charset_mismatch.lasso > > For fun, try MacRoman encoding in Internet Explorer, you'll see it is > rendered as Windows-1252. > > As already mentioned, years ago there were browsers/versions without this bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really a problem, as already discussed. A practical problem currently only appears, if a document has the wrong encoding information and a legacy browser without the bug is used to present it. Because not all legacy browsers have this bug, it is a misinformation of authors to make them believe, that everything is solved for buggy documents, because all browser compensate this bug with yet another browser bug. However, authors of documents with wrong encoding informations are guilty anyway, this is not really a problem of a specification. It is just more difficult to teach them to write proper documents. Indifferent or ignorant authors are not necessarily a problem for a specification, they are a problem mainly for the general audience. The questions are more about the problem, how to indicate, that a ' HTML5' document really has the encoding ISO-8859-1. This can be important for long living documents and archival storage. Because in 50 or 100 or 1000 years one cannot rely on the behaviour of browsers of the year 2009, but it might be still possible to decode well defined documents with completely different programs. To simplify this, one should have simple and intuitive indications and not such a bloomer like to write 'ISO-8859-1' if you mean 'Windows-1252'. With the current draft, one can only recommend 'HTML5'+UTF-8 or another format/version like XHTML+RDFa for long living documents and archival storage (what is not necessarily bad too, just something interesting to know for some people). Olaf
Received on Friday, 31 July 2009 18:19:43 UTC