Re: [HTML5] 2.8 Character encodings from Bil Corry on 2009-07-31 (public-html-comments@w3.org from July 2009)

From: Bil Corry <bil@corry.biz>
Date: Fri, 31 Jul 2009 11:47:51 -0500
To: "Dr. Olaf Hoffmann" <Dr.O.Hoffmann@gmx.de>
CC: public-html-comments@w3.org
Message-ID: <4A732037.1050200@corry.biz>
Just so I'm clear, the problem you are defining is that HTML5 requires browsers to interpret ISO-8859-1 as Windows-1252 [1], and traditionally authors that used Windows-1252 encoding, but specified ISO-8859-1 for the character set, were given feedback of the mismatch because the browser would fail to render the correct glyph.  And because this feedback has been removed in HTML5 (the correct glyph is shown despite the charset mismatch), it goes unnoticed by the author and the page remains misidentified as ISO-8859-1 when it's really Windows-1252.  Is that correct?

It may be worthwhile to point out that Firefox 2 & 3 and Internet Explorer 7 & 8 all render content encoded as Windows-1252 but identified as ISO-8859-1 as Windows-1252 -- the exact behavior of HTML5.  And it's possible older versions of FF/IE and other browsers also have this behavior.  So the lack of feedback already exists, which may explain the preponderance of pages misidentified as ISO-8869-1.

Given the above, I'd argue that HTML5 is more compatible with existing behavior than it would be if it required strictly render of ISO-8859-1.

If you are curious, you can test your browser to see how it renders UTF-8 and Windows-1252 byte streams given a particular encoding:

 http://www.corry.biz/charset_mismatch.lasso

For fun, try MacRoman encoding in Internet Explorer, you'll see it is rendered as Windows-1252.


- Bil

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#misinterpreted-for-compatibility



Dr. Olaf Hoffmann wrote on 7/31/2009 5:53 AM: 
> Hello,
> 
> within the last ten or more years I have already a lot
> of experience especially with german authors (they 
> typically want to use Umlaute and the ß-ligature and
> sometimes the Euro-sign) and their problems.
> Every month I have to explain again, how to 
> distinguish between UTF-8 and ISO-8859-1(5) and
> that ISO-8859-1 has no BOM - respectively that
> this is displayed as visible characters in some
> browser versions etc.
> One has to explain, that the indication of the
> server (stupid or correct) is more relevant than
> indications within the document and how to
> get it correct with PHP or with the .htaccess file
> of the Apache web-server and that one cannot
> change the encoding within one document.
> 
> And of course, I still have in mind the behaviour
> of the browsers I used years ago, when the
> Euro was introduced in europe and some authors
> tried to use the Euro-sign without masking
> within ISO-8859-1. There was a clear indication,
> that this does not work in these legacy browsers,
> therefore it is not true, that the interpretation
> of 'ISO-8859-1' as 'Windows-1252' is compatible
> with older browsers - they/some had no bug and 
> indicated the not representable character for
> example with a question mark, a box etc.
> And indeed, it was simple to explain, that the
> author either has to use another encoding or
> has to mask the character to fix his/her bug.
> This simple approach is maybe corrupted
> now with bugs in current versions of browsers.
> 
> 
> 'HTML5' seems to introduce a new rule how to
> identify the encoding. The encoding problem 
> is obviously already hardly understandable
> for many authors. Suddenly this new rule with
> some opaque method to identify 'HTML5' 
> documents complicates the situation even more
> and makes it much harder to explain, what to
> do to get a well defined document or script
> output and how to fix bugs.
> 
> Therefore the main questions remain open up to
> here:
> 
> 1. How to indicate the 'ISO-8859-1' encoding
> within an 'HTML5' document and not
> 'Windows-1252', if an author wants to specify
> 'ISO-8859-1' and nothing else?
> 
> 2. How does a proper viewer/browser identify, 
> that a document is 'HTML5' and that this
> specific rule has to be applied, if 'ISO-8859-1'
> is indicated.
> 
> 3. At which point the encoding information
> switches from the information given by the
> server or the XML processing instruction
> to the specific rule of 'HTML5' to interprete
> the string  'ISO-8859-1' as indication for
> 'Windows-1252'?
> 
> Indeed, up to here, this is all about encoding
> information, not how a document is decoded
> by the viewer (buggy or not).
> 
> 
> Olaf
>
Received on Friday, 31 July 2009 16:48:40 UTC