Re: [HTML5] 2.8 Character encodings from Dr. Olaf Hoffmann on 2009-07-31 (public-html-comments@w3.org from July 2009)

From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
Date: Fri, 31 Jul 2009 12:53:27 +0200
To: public-html-comments@w3.org
Message-Id: <200907311253.27745.Dr.O.Hoffmann@gmx.de>

Hello,

within the last ten or more years I have already a lot
of experience especially with german authors (they 
typically want to use Umlaute and the ß-ligature and
sometimes the Euro-sign) and their problems.
Every month I have to explain again, how to 
distinguish between UTF-8 and ISO-8859-1(5) and
that ISO-8859-1 has no BOM - respectively that
this is displayed as visible characters in some
browser versions etc.
One has to explain, that the indication of the
server (stupid or correct) is more relevant than
indications within the document and how to
get it correct with PHP or with the .htaccess file
of the Apache web-server and that one cannot
change the encoding within one document.

And of course, I still have in mind the behaviour
of the browsers I used years ago, when the
Euro was introduced in europe and some authors
tried to use the Euro-sign without masking
within ISO-8859-1. There was a clear indication,
that this does not work in these legacy browsers,
therefore it is not true, that the interpretation
of 'ISO-8859-1' as 'Windows-1252' is compatible
with older browsers - they/some had no bug and 
indicated the not representable character for
example with a question mark, a box etc.
And indeed, it was simple to explain, that the
author either has to use another encoding or
has to mask the character to fix his/her bug.
This simple approach is maybe corrupted
now with bugs in current versions of browsers.


'HTML5' seems to introduce a new rule how to
identify the encoding. The encoding problem 
is obviously already hardly understandable
for many authors. Suddenly this new rule with
some opaque method to identify 'HTML5' 
documents complicates the situation even more
and makes it much harder to explain, what to
do to get a well defined document or script
output and how to fix bugs.

Therefore the main questions remain open up to
here:

1. How to indicate the 'ISO-8859-1' encoding
within an 'HTML5' document and not
'Windows-1252', if an author wants to specify
'ISO-8859-1' and nothing else?

2. How does a proper viewer/browser identify, 
that a document is 'HTML5' and that this
specific rule has to be applied, if 'ISO-8859-1'
is indicated.

3. At which point the encoding information
switches from the information given by the
server or the XML processing instruction
to the specific rule of 'HTML5' to interprete
the string  'ISO-8859-1' as indication for
'Windows-1252'?

Indeed, up to here, this is all about encoding
information, not how a document is decoded
by the viewer (buggy or not).


Olaf

Received on Friday, 31 July 2009 11:16:52 UTC