Re: [HTML5] 2.8 Character encodings from Bil Corry on 2009-07-31 (public-html-comments@w3.org from July 2009)

From: Bil Corry <bil@corry.biz>
Date: Fri, 31 Jul 2009 15:08:12 -0500
To: "Dr. Olaf Hoffmann" <Dr.O.Hoffmann@gmx.de>
CC: public-html-comments@w3.org
Message-ID: <4A734F2C.2000303@corry.biz>
Dr. Olaf Hoffmann wrote on 7/31/2009 1:10 PM: 
> With the still open questions I mainly try to find out whether it is 
> possible to specify, that a 'HTML5' document has an encoding like 
> 'ISO-8859-1' and not 'Windows-1252'.

You do it the same way as you would for any character set, by specifying the content encoding as ISO-8859-1.  Typically this is done via the Content-Type header:

 Content-Type: text/html; charset=ISO-8859-1

That header means, "This HTML document is in the ISO-8859-1 character set."  By inference, it also means that it isn't Windows-1252, or UTF-8, etc.

 
> As far as I understand the specification, this is not possible

I think what you mean is it isn't possible to force the UA to use the ISO-8859-1 charset when specified and you're right.  As I mentioned in my previous email, IE will display Windows-1252 when MacRoman is specified which is clearly wrong -- the glyphs don't even come close to matching each other.  As an author you have to work around that.  And as an author you must ensure the encoding is correct to get consistent results.


> but
> then the other questions become interesting, because typically
> it is relevant what the server indicates, not what is mentioned
> explictely or implicitely in the document.

I haven't ever tested what happens when the content-type header doesn't match the meta, but considering it's incorrect, an author can hardly expect positive results.  I wonder if HTML5 specifies the behavior in this case?


> As already mentioned, years ago there were browsers/versions without this
> bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really a
> problem, as already discussed. A practical problem currently only appears,
> if a document has the wrong encoding information and a legacy browser
> without the bug is used to present it. Because not all legacy browsers
> have this bug, it is a misinformation of authors to make them believe, that
> everything is solved for buggy documents, because all browser
> compensate this bug with yet another browser bug. However, authors
> of documents with wrong encoding informations are guilty anyway,
> this is not really a problem of a specification. It is just more difficult to
> teach them to write proper documents. Indifferent or ignorant 
> authors are not necessarily a problem for a specification, they are
> a problem mainly for the general audience.

This also describes the issue with browsers doing a best-guess with rendering HTML content that is malformed.  The solution to both malformed HTML and misidentified charsets is to run the page through validator.w3.org -- both get flagged if wrong.  If you want to see, try this:

 http://validator.w3.org/check?uri=http%3A%2F%2Fwww.corry.biz%2Fcharset_mismatch.lasso%3Fcharset%3DISO-8859-1

It (correctly) returns the error:

 Using windows-1252 instead of the declared encoding iso-8859-1.
 Line 22, Column 87: Unmappable byte sequence: 9d.

So if your authors care about checking their markup with validator.w3.org, they will also have their charset checked as well.


> The questions are more about the problem, how to indicate, 
> that a ' HTML5' document really has the encoding ISO-8859-1. 
> This can be important for long living documents and archival storage. 
> Because in 50 or 100 or 1000 years one cannot rely on the behaviour 
> of browsers of the year 2009, but it might be still possible to decode 
> well defined documents with completely different programs. 
> To simplify this, one should have simple and intuitive indications
> and not such a bloomer like to write 'ISO-8859-1' if you mean 
> 'Windows-1252'.

A 1000 years from now, if they do what HTML5 does now and use Windows-1252 when ISO-8859-1 is specified, they'll be guaranteed to correctly view the document (assuming it's in either ISO-8859-1 or Windows-1252).  The same can not be said for viewing Windows-1252 as ISO-8859-1.


> With the current draft, one can only recommend 'HTML5'+UTF-8
> or another format/version like XHTML+RDFa for long living 
> documents and archival storage (what is not necessarily bad too, 
> just something interesting to know for some people).

UTF-8 isn't free from issues either -- I've seen Windows-1252 served as UTF-8 which produces illegal byte sequences.  Or here's an example where the page (Windows-1252) doesn't specify a charset at all; in Firefox it's rendered as UTF-8 with broken bytes and IE it's rendered with the correct charset of Windows-1252:

 http://cspinet.org/new/200907301.html

Which browser do you think they test their site with?  Which browser do you think the end user thinks is broken?


- Bil
Received on Friday, 31 July 2009 20:08:55 UTC