Re: [HTML5] 2.8 Character encodings from Dr. Olaf Hoffmann on 2009-07-28 (public-html-comments@w3.org from July 2009)

From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
Date: Tue, 28 Jul 2009 19:51:18 +0200
To: public-html-comments@w3.org
Message-Id: <200907281951.18859.Dr.O.Hoffmann@gmx.de>
Ian Hickson:
> On Mon, 6 Jul 2009, Dr. Olaf Hoffmann wrote:
> > in the current draft are mentioned in 2.8
> > http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character
> >-encodings-0 some 'willful' misinterpretations of encoding information,
> > for example to interprete a string like 'ISO-8859-1' as 'Windows-1252'.
> >
> > 1. Which string has an author to note, if he really wants to indicate,
> > that the encoding is for example 'ISO-8859-1' and not 'Windows-1252'?
>
> "ISO-8859-1". If the author has really used that encoding, then there is
> no difference between them (1252 is a superset).
>

I know, that there is only a difference for a few characters, not relevant
for ISO-8859-1 usage (see tables at wikipedia for example). 
However, because I prefer to provide only well defined documents 
in publications, I want to be sure, that if ISO-8859-1 is mentioned 
within a document, that this really means ISO-8859-1
and not something else. The current draft defines something else,
something I never used (and my preferred editor kate from KDE
is able to distinguish between 'ISO-8859-1' and 'cp 1252' (Windows-1252
seems to be an alias for this encoding).
The only way I can see with the current draft is not to use the string
'ISO-8859-1' at all for 'HTML5' because this format defines that 
this is interpreted as 'Windows-1252' and not as intended as 'ISO-8859-1'.
This is a problem, because 'ISO-8859-1' is the default encoding for
HTML4 for example. Therefore to switch from project still having 
HTML4 (and no XHTML already) to 'HTML5' seems to require to
switch to UTF-8 to avoid plurivalences  with the encoding due to
the current draft. 


> > 2. As far as I have seen, HTML5 has no version indication like previous
> > versions of HTML had and other popular formats like SVG have.
> > How can a browser identify, that a document is really intended as
> > 'HTML5' with the implicated  'willful' misinterpretations of encoding
> > information and no other HTMLversion?
>
> It doesn't matter, all versions of HTML are in practice processed with
> these mappings. It is indeed why HTML5 has these mappings -- because
> browsers already did this. We wouldn't add these mappings if we didn't
> have to to handle legacy content (content in previous versions of HTML).
>

Well for HTML4 and XHTML1.x and all other XML formats this is simple
a bug of the browser, nothing to worry about for authors, because the
string 'ISO-8859-1' has a well defined meaning in all these formats 
completely independent from the behaviour of current buggy browsers.
And if some authors are forced by this bug to indicate 'cp 1252' as
'ISO-8859-1' I think, this is even more and indication, that this bug
has to be fixed in browsers to inform those authors to fix their documents
to get a well defined encoding for their document instead of hiding
such a bug to prevent authors to fix such a nasty bug.

Especially 'ISO-8859-1' authors currently do not have to worry about 
the bug, because they (typically) do not use the characters with 
different meaning in both encodings.
With the current 'HTML5' draft already the indication as 'ISO-8859-1'
is plurivalent and has to be avoided to create a well defined document
in the format 'HTML5' (or 'HTML5' has to be avoided to create a
well defined document).

> > Assuming that a viewer is able to identify a document somehow being a
> > HTML5 document after looking into the content and for example a server
> > sended 'ISO-8859-1' before, does this mean, that the viewer switches to
> > or reparses the document with 'Windows-1252' again?
>
> I don't understand the question.

If a server, an XML-processing instruction or maybe a meta-element 
indicates the encoding as 'ISO-8859-1' a proper browser has to encode
the document with 'ISO-8859-1' (with the implication that some characters
defined differently in 'Windows-1252' do not have a useful graphical or
acoustical representation in 'ISO-8859-1'). If within the processing, these
hypothetical proper browser is able to detect somehow, that the current
document is 'HTML5', the browser has to switch the encoding and some
characters may be interpreted differently (maybe including a useful
graphical or acoustical representation). 
Obviously there a several options for such a proper browser when and
how to switch. Of course, this problem does not occur, if a buggy
browser interpretes the encoding wrong for all formats, not only for
'HTML5' ;o) But 'HTML5' cannot redefine how to interprete encoding
information for other formats or versions (HTML4, XHTML1, SVG, MathML,
RDF, DAISY, FictionBook etc) 

>
> > Obviously it would be better to avoid such misinterpretation by using an
> > encoding like UTF-8 not confused by the current HTML5 draft, however due
> > to the history of older projects or server configurations it might be
> > still convenient for many authors to continue to use 'ISO-8859-1'
> > instead of other encodings, even if they switch for example from HTML4
> > to HTML5 for some documents.
>
> Hopefully my answers above will reassure you that this is not in fact a
> problem that authors will face.

Yes and no - the behaviour of current buggy browsers for other formats
is in practice not really a problem for authors of 'ISO-8859-1' documents,
because these document typically will not contain the plurivalent 
characters.
However, if it is specified that 'ISO-8859-1' is not 'ISO-8859-1' in 'HTML5',
this is a general problem for authors wanting to create well defined
documents with this encoding, because there is currently no string for
'HTML5' to provide the proper encoding information, because the
common used string 'ISO-8859-1' is tainted or corrupted by the
current 'HTML5' draft.

If it is known, that many buggy browser use the wrong encoding, 
of course, this can be mentioned in the 'HTML5' draft as an 
(important) informational note for authors to be careful, but the 
draft should not redefine the meaning of the string incompatible 
to any other format.
The 'HTML5' draft should not mix up reporting browser bugs 
with proper definitions. If this happens more often, thoughtful 
authors may get the impression, that 'HTML5' is only about the
the behaviour of current buggy browsers and not something that
defines a new version of (X)HTML - what is something completely
differrent. Even if there is no browser without bugs, an author
can still write well defined documents in such a format, but
this is not possible, if the format itself has major plurivalences
and contradiction.
Indeed, though there was never a browser interpreting 
HTML4 complete and correct, it is still pretty useful to write
documents (maybe less useful than XHTML, but 'HTML5'
avoids this problem alread defining an XML variant too).
Containing too many plurivalences, bloomers and historical 
stupidities due to browser bugs, 'HTML5' will never be useful 
to create well defined documents.
Received on Tuesday, 28 July 2009 18:00:00 UTC