Re: [HTML5] 2.8 Character encodings from Dr. Olaf Hoffmann on 2009-07-31 (public-html-comments@w3.org from July 2009)

From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
Date: Fri, 31 Jul 2009 20:10:16 +0200
To: public-html-comments@w3.org
Message-Id: <200907312010.16897.Dr.O.Hoffmann@gmx.de>
Bil Corry:
> Just so I'm clear, the problem you are defining is that HTML5 requires
> browsers to interpret ISO-8859-1 as Windows-1252 [1], and traditionally
> authors that used Windows-1252 encoding, but specified ISO-8859-1 for the
> character set, were given feedback of the mismatch because the browser
> would fail to render the correct glyph.  And because this feedback has been
> removed in HTML5 (the correct glyph is shown despite the charset mismatch),
> it goes unnoticed by the author and the page remains misidentified as
> ISO-8859-1 when it's really Windows-1252.  Is that correct?
>

With the still open questions I mainly try to find out whether it is 
possible to specify, that a 'HTML5' document has an encoding like 
'ISO-8859-1' and not 'Windows-1252'.

As far as I understand the specification, this is not possible, but
then the other questions become interesting, because typically
it is relevant what the server indicates, not what is mentioned
explictely or implicitely in the document.
And if 'HTML5' changes the relevance (what is not necessarily
bad) of the encoding information, I want to know of course, 
how a viewer/browser identifies a document as 'HTML5', because
for other formats or versions this behaviour is simply a bug and
not a feature. This is more a problem of the missing version
indication, for example the newest XHTML variant has the
attribute version="XHTML+RDFa 1.0", the formal identification
problem would be solved for 'HTML5' for example with
version="HTML5", then the document is at least well defined
and specific rules are applicable, if the specification is known
to the decoder.
A proper browser of course has to search for such a version
indication to know, which encoding information applies, before
the document is presented. 
Even if current browsers do it differently (wrong), this is no reason,
that there is no rule required to define a correct way.


> It may be worthwhile to point out that Firefox 2 & 3 and Internet Explorer
> 7 & 8 all render content encoded as Windows-1252 but identified as
> ISO-8859-1 as Windows-1252 -- the exact behavior of HTML5.  And it's
> possible older versions of FF/IE and other browsers also have this
> behavior.  So the lack of feedback already exists, which may explain the
> preponderance of pages misidentified as ISO-8869-1.
>
> Given the above, I'd argue that HTML5 is more compatible with existing
> behavior than it would be if it required strictly render of ISO-8859-1.
>
> If you are curious, you can test your browser to see how it renders UTF-8
> and Windows-1252 byte streams given a particular encoding:
>
>  http://www.corry.biz/charset_mismatch.lasso
>
> For fun, try MacRoman encoding in Internet Explorer, you'll see it is
> rendered as Windows-1252.
>
>

As already mentioned, years ago there were browsers/versions without this
bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really a
problem, as already discussed. A practical problem currently only appears,
if a document has the wrong encoding information and a legacy browser
without the bug is used to present it. Because not all legacy browsers
have this bug, it is a misinformation of authors to make them believe, that
everything is solved for buggy documents, because all browser
compensate this bug with yet another browser bug. However, authors
of documents with wrong encoding informations are guilty anyway,
this is not really a problem of a specification. It is just more difficult to
teach them to write proper documents. Indifferent or ignorant 
authors are not necessarily a problem for a specification, they are
a problem mainly for the general audience.
 
The questions are more about the problem, how to indicate, 
that a ' HTML5' document really has the encoding ISO-8859-1. 
This can be important for long living documents and archival storage. 
Because in 50 or 100 or 1000 years one cannot rely on the behaviour 
of browsers of the year 2009, but it might be still possible to decode 
well defined documents with completely different programs. 
To simplify this, one should have simple and intuitive indications
and not such a bloomer like to write 'ISO-8859-1' if you mean 
'Windows-1252'.


With the current draft, one can only recommend 'HTML5'+UTF-8
or another format/version like XHTML+RDFa for long living 
documents and archival storage (what is not necessarily bad too, 
just something interesting to know for some people).


Olaf
Received on Friday, 31 July 2009 18:19:43 UTC