Re: [HTML5] 2.8 Character encodings from Dr. Olaf Hoffmann on 2009-08-01 (public-html-comments@w3.org from August 2009)

From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
Date: Sat, 1 Aug 2009 17:08:49 +0200
To: public-html-comments@w3.org
Message-Id: <200908011708.49598.Dr.O.Hoffmann@gmx.de>
Bil Corry:
> Dr. Olaf Hoffmann wrote on 7/31/2009 1:10 PM:
> > With the still open questions I mainly try to find out whether it is
> > possible to specify, that a 'HTML5' document has an encoding like
> > 'ISO-8859-1' and not 'Windows-1252'.
>
> You do it the same way as you would for any character set, by specifying
> the content encoding as ISO-8859-1.  Typically this is done via the
> Content-Type header:
>
>  Content-Type: text/html; charset=ISO-8859-1
>
> That header means, "This HTML document is in the ISO-8859-1 character set."
>  By inference, it also means that it isn't Windows-1252, or UTF-8, etc.

This I know and is true for other formats but 'HTML5', the current draft
of 'HTML5' has a specific rule, that this means 'Windows-1252' and
not 'ISO-8859-1' - and this seems to supersede what the server indicates,
if a viewer is able to identify is as a 'HTML5' document.

>
> > As far as I understand the specification, this is not possible
>
> I think what you mean is it isn't possible to force the UA to use the
> ISO-8859-1 charset when specified and you're right.  

It is never possible for an author to force an arbitrary viewer to do
something (specific). An careful author can only write well defined
and valid documents. And an author can test it with some currently
known viewers. The results of these tests are not necessarily 
representative for other viewers, not accessible for the author
(for example because they are not published yet).



> As I mentioned in my 
> previous email, IE will display Windows-1252 when MacRoman is specified
> which is clearly wrong -- the glyphs don't even come close to matching each
> other.  As an author you have to work around that.  And as an author you
> must ensure the encoding is correct to get consistent results.
>

Maybe this browser does not know this encoding, however then the user
should be warned, that in such a case something stupid might happen.
The warning is clearly the problem of the implementor, not the author.
Of course, the author can avoid such problems, using more common 
encodings like UTF-8 or ISO-8859-1.

> > but
> > then the other questions become interesting, because typically
> > it is relevant what the server indicates, not what is mentioned
> > explictely or implicitely in the document.
>
> I haven't ever tested what happens when the content-type header doesn't
> match the meta, but considering it's incorrect, an author can hardly expect
> positive results.  I wonder if HTML5 specifies the behavior in this case?
>

I think, 'HTML5' only specifies a specific rule for these few strings, 
that 'Windows-1252' is used instead of them. Only if the author
provides these strings, this indication is superseded, independently
from the question, how the encoding was indicated. If the server
sends with the content-type header UTF-8, the specific 'HTML5' rule
does not apply and everything is interpreted as UTF-8, no matter
what is mentioned within the document.


> > As already mentioned, years ago there were browsers/versions without this
> > bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really
> > a problem, as already discussed. A practical problem currently only
> > appears, if a document has the wrong encoding information and a legacy
> > browser without the bug is used to present it. Because not all legacy
> > browsers have this bug, it is a misinformation of authors to make them
> > believe, that everything is solved for buggy documents, because all
> > browser
> > compensate this bug with yet another browser bug. However, authors
> > of documents with wrong encoding informations are guilty anyway,
> > this is not really a problem of a specification. It is just more
> > difficult to teach them to write proper documents. Indifferent or
> > ignorant
> > authors are not necessarily a problem for a specification, they are
> > a problem mainly for the general audience.
>
> This also describes the issue with browsers doing a best-guess with
> rendering HTML content that is malformed.  The solution to both malformed
> HTML and misidentified charsets is to run the page through validator.w3.org
> -- both get flagged if wrong.  If you want to see, try this:
>
>  http://validator.w3.org/check?uri=http%3A%2F%2Fwww.corry.biz%2Fcharset_mis
>match.lasso%3Fcharset%3DISO-8859-1
>
> It (correctly) returns the error:
>
>  Using windows-1252 instead of the declared encoding iso-8859-1.
>  Line 22, Column 87: Unmappable byte sequence: 9d.
>
> So if your authors care about checking their markup with validator.w3.org,
> they will also have their charset checked as well.
>

Well, the validator has some bugs too. If no encoding is specified for
text/*, iso-8859-1 should be used (according to HTTP).
I noticed already some months ago, that the validator noted wrong
bugs for XHTML+RDFa instead of indicating, that this cannot be 
validated yet (without an optional doctype within the document,
however the version indication is done with the version attribute,
not a doctype). I noted errors for SVG documents too. The validator is 
a pretty good help in many cases (less useful for SVG, because it
checks almost no attribute values), but an author still has to be 
careful with the results...


> > The questions are more about the problem, how to indicate,
> > that a ' HTML5' document really has the encoding ISO-8859-1.
> > This can be important for long living documents and archival storage.
> > Because in 50 or 100 or 1000 years one cannot rely on the behaviour
> > of browsers of the year 2009, but it might be still possible to decode
> > well defined documents with completely different programs.
> > To simplify this, one should have simple and intuitive indications
> > and not such a bloomer like to write 'ISO-8859-1' if you mean
> > 'Windows-1252'.
>
> A 1000 years from now, if they do what HTML5 does now and use Windows-1252
> when ISO-8859-1 is specified, they'll be guaranteed to correctly view the
> document (assuming it's in either ISO-8859-1 or Windows-1252).  The same
> can not be said for viewing Windows-1252 as ISO-8859-1.
>
> > With the current draft, one can only recommend 'HTML5'+UTF-8
> > or another format/version like XHTML+RDFa for long living
> > documents and archival storage (what is not necessarily bad too,
> > just something interesting to know for some people).
>
> UTF-8 isn't free from issues either -- I've seen Windows-1252 served as
> UTF-8 which produces illegal byte sequences.  Or here's an example where
> the page (Windows-1252) doesn't specify a charset at all; in Firefox it's
> rendered as UTF-8 with broken bytes and IE it's rendered with the correct
> charset of Windows-1252:
>
>  http://cspinet.org/new/200907301.html
>
> Which browser do you think they test their site with?  Which browser do you
> think the end user thinks is broken?
>

The Gecko I use indicated ISO-8859-1 as well as Konqueror, no UTF-8 
or Windows-1252, Opera notes a problem (unsupported) and notes too, 
that Windows-1252 is used. 
Because this tag soup is served as text/html without encoding 
information, I think, the Gecko and Konqueror are perfectly correct 
with ISO-8859-1.
There is no indication, that this might be 'HTML5'. Therefore no
specific rule from the 'HTML5' draft needs to be applied.
The document does not indicate at all, which version of HTML it
uses; looking into the source code, I think it uses a proprietary
private slang of the author (even comments are noted wrong ;o)
It cannot be expected from a program, that any version information
can be derived from this tag soup.
The presentation cannot be wrong, because it is undefined
(a viewer may use any set of rules to interprete this or reject it
as nonsense; a viewer can use 'HTML5' to try to interprete this,
but can use any other set of rules as well, because the author
does not indicate a version informtion at all).
The encoding however seems to be defined (only) by the 
HTTP protocol, not by the document itself.

Of course, ISO-8859-1 (and Windows-1252) are only compatible
with UTF-8 for a basic set of characters. This can be seen more
easily with documents not in english but in other languages like 
german, french, spanish, pretty well covered with ISO-8859-1 but 
for several important characters with different encoding in 
ISO-8859-1 and UTF-8. Because there is no indication, that this
might be XML, there is no need to think, that UTF-8 is a proper
choice for decoding.
Received on Saturday, 1 August 2009 15:36:26 UTC