Re: [HTML5] 2.8 Character encodings from Dr. Olaf Hoffmann on 2009-08-04 (public-html-comments@w3.org from August 2009)

From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
Date: Tue, 4 Aug 2009 17:12:29 +0200
To: public-html-comments@w3.org
Message-Id: <200908041712.30108.Dr.O.Hoffmann@gmx.de>
Anne van Kesteren:
> On Tue, 04 Aug 2009 12:17:49 +0200, Dr. Olaf Hoffmann
>
> <Dr.O.Hoffmann@gmx.de> wrote:
> > Well, this is a problem for CSS too, because some properties are
> > defined differently in CSS2.1 than in CSS2.
>
> Things change. This is not necessarily a problem in my experience.

That is, why to have different versions and a version indication is
useful. 'HTML5' changes or redefines the meaning and partly the
functionality of some elements too, compared to HTML4 (or XHTML1.x). 
There is a difference for example for elements like small or object.
Again a version indication is useful, but of course does not solve the
problem completely (for example if the elements are used in
mixed XML-documents, indicating only the namespace, not the version).

And it is of course a problem for the interpretation of older documents,
at least, if the creation time is not conserved. Obviously a stylesheet
from 2000 does not mean CSS2.1, for 2007 no one knows if 2.0 or 2.1.
Not every HTML document without a version indication is 'HTML5' -
could be old tag soup as well. Without a version indication it should
be no problem, to interprete it with an arbitrary rule set, including
'HTML5', but this is no indication, that such a tag soup document
has a well defined meaning at all.

>
> > I discovered this some time ago for example for clipping for
> > some SVG test documents, which appeared wrong in Opera.
> > SVG depends on CSS2, therefore these tests are still well
> > defined, applied to (X)HTML they are not testable anymore,
> > because CSS has no version indication.
>
> This is one way of looking at the problem. Another way of looking at the
> problem is that specifications cannot incrementally evolve like
> implementations do and are therefore not always accurate in what they say.

For this example, both CSS2.0 and CSS2.1 have understandable
and testable definitions of the same property, they are just different 
(by the way, the behaviour I found for Opera is different from both ;o)
Without version indication this is no evolution, this simply means, 
that is has no relyable meaning (and a different interpretation in
different versions of browsers as well). Both meaning and 
interpretation get unpredictable for authors, the second at least
for several years.


> Just like normal languages Web languages change now and then.
>

But not the versions and not the intended meaning of already
published documents. If this would be the case, it would be
impossible to write well defined documents/content in an electronic
format at all. Maybe the interpretations changes, because we
have other experiences now than when such a document was
written, but this does not change, what the author had in mind,
when the document was created.


> > For 'HTML5' - as long as I cannot simply write version="HTML5"
> > I cannot start to write HTML5 documents.
>
> In effect your documents will be treated as HTML5 regardless.
>

Here you mix up again interpretation and content. 
If you like, you can interprete a 'microsoft word' document or 
a postscript or PDF as well as 'HTML5', this does not necessarily 
mean, that you get the intended meaning of the document with 
such an interpretation and it is not obvious, how to interprete this
in a useful way, if author/server send them as text/html.
This can already happen with translations of documents from one
language to another (Shakespeare in german, Goethe in english?),
it is obviously even worse, if you try to interprete such documents
in the other language without a translation - but of course, you can
do it ;o)

A specification like that for 'HTML5' has more than the aspect
of interpretation. 
For authors, it has the aspect too to indicate, specifiy, markup
the intended meaning. Once done, this does not change, if the
author does not change the document. But the interpretation 
can change any time or can depend on the reader (or tool the 
reader uses to interprete the document).


> > Already this is a 'show stopper' for 'HTML5' currently. One can still
> > discuss the
> > current draft, but for formal reasons one cannot write a
> > 'HTML5' document ;o)
>
> As far as I know there are no formal reasons why one cannot write HTML5
> documents and publish them and in fact many people are authoring HTML5
> documents and publishing them.
>

Well do you have samples?
How do they identify them as 'HTML5'? and distinguish from undefined
tag soup without a version indication?
Not, how these document are maybe interpreted today, what does the 
author indicate, what they are? 
Maybe in ten years they are interpreted as 'HTML6' - does it mean, that
an author has written them as 'HTML6' today?
If I have written some HTML tag soup in 1997, does it mean, that this is
'HTML5', just because this tag soup has no version indication? Does it
mean in 10 years, that I have written 'HTML6' already in 1997?
No, if the creation date is known, it simply means, that this document is
one of my first stupid attempts to write an HTML document, not more.
And a current browser will typically not interprete this old tag soup
like the netscape3.2 I used to 'check' the appearance of this tag soup,
what is no problem, because it is simply tag soup without
a requirement for a specific interpretation or a well defined meaning.


> > There is no problem to write HTML3.2, HTML4, XHTML1.0,
> > XHTML1.1 or XHTML+RDFa, even if for some of them the
> > version indication is not very elegant and not very relevant
> > for typical user agents.
>
> This assumes versioning is necessary. Experience with Web browsers shows
> that this is not needed and avoids a lot of complexity.
>

Web browser only interprete documents, and the common experience shows,
that they do it wrong and incomplete. And of course, it helps a lot to have a
look into the source code, find a well defined, structured document with 
version indication for the interpretation, if the browser fails. One can try
to find the problem - more often bugs in the document than a real
problem with the browser. And one can try to find out, what was intended
by the author. For flawed documents, the intention of the author is often
different from the interpretation of the browser, what is basically a task
for the author to fix the bugs - one cannot expect, that a simple program
like a browser can guess, what was intented by the author of  a flawed 
document. And even if some error management is defined in 'HTML5',
one cannot assume, that the result is often close to the intention of the
author, it is just one interpretation of a flawed document.


> >> No, you can just specify it. Just like you can in HTML4.
> >
> > I can write the string, but indeed, if I do it, it means 'Windows-1252'.
>
> Not for your authoring tool or a conformance checker.
>
> > Therefore effectively, I cannot indicate, that something is
> > 'ISO-8859-1' and not 'Windows-1252'.
>
> You cannot indicate that something needs to be decoded as ISO-8859-1 by
> Web browsers for text/html content. This has been the case for a long time
> and is nothing new.

It is at least new in 'HTML5', that one cannot indicate 'ISO-8859-1'
as encoding, what is a difference, again the problem to mix up the
content of a document with the interpretation of it. There is no way for
an author to prevent any interpretation. But the author can try to indicate,
what was intented.
And as I already mentioned, I have seen years ago already browsers
without this behaviour/bug, what was somehow interesting for 
documents with wrong encoding information and unmasked euro signs ;o)
Received on Tuesday, 4 August 2009 15:37:52 UTC