Re: [HTML5] 2.8 Character encodings from Dr. Olaf Hoffmann on 2009-08-05 (public-html-comments@w3.org from August 2009)

From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
Date: Wed, 5 Aug 2009 10:58:43 +0200
To: public-html-comments@w3.org
Message-Id: <200908051058.43744.Dr.O.Hoffmann@gmx.de>
Anne van Kesteren:

....

> I am just trying to explain the state of affairs with respect to CSS here.
> While you can claim it is supposed to be different, that does not make it
> so.

No, I think, both CSS and 'HTML5' share the same problem without
a version indication. However, CSS is just about styling, in doubt
one can switch it off. (X)HTML is about content and it is more important
to have a well defined and not plurivalent meaning of elements
(or encoding or other things related to content). 


....



> >> > For 'HTML5' - as long as I cannot simply write version="HTML5"
> >> > I cannot start to write HTML5 documents.
> >>
> >> In effect your documents will be treated as HTML5 regardless.
> >
> > Here you mix up again interpretation and content.
> > If you like, you can interprete a 'microsoft word' document or
> > a postscript or PDF as well as 'HTML5', this does not necessarily
> > mean, that you get the intended meaning of the document with
> > such an interpretation and it is not obvious, how to interprete this
> > in a useful way, if author/server send them as text/html.
> > This can already happen with translations of documents from one
> > language to another (Shakespeare in german, Goethe in english?),
> > it is obviously even worse, if you try to interprete such documents
> > in the other language without a translation - but of course, you can
> > do it ;o)
>
> This is the same point as above. Authors do not write against
> specifications.

Well not all authors ...
As an author I started to write a specification, because (X)HTML did
not cover, what I wanted to markup ;o)
And those people with microformats and RDF(a) indicate, that not
all authors want to write things, already appearing somehow in
common browsers, but have no semantical well defined meaning ;o)

Those authors, who care mainly about the appearance in current
browsers typically do not have long living documents in mind, even
if with some luck some of these documents remain several years.
Or they do not have much experience and rely to much on the
behaviour of the current version of their preferred browser.
Within the last ten years such authors typically had a lot of work
updating a few documents to the current behaviour of the newest
browser version. I think, this is not really a promising perspective
for electronic formats (as for computer hardware this is no
standard, this permanent handicraft work).


>
> >> > Already this is a 'show stopper' for 'HTML5' currently. One can still
> >> > discuss the
> >> > current draft, but for formal reasons one cannot write a
> >> > 'HTML5' document ;o)
> >>
> >> As far as I know there are no formal reasons why one cannot write HTML5
> >> documents and publish them and in fact many people are authoring HTML5
> >> documents and publishing them.
> >
> > Well do you have samples?
>
> There is
>
>    http://html5gallery.com/
>
> for instance, which collects sites made in HTML5.

Indeed, within the content, the meta description or keywords it often
appears, that those projects are intended to be HTML5. For several
others it can be guessed, because elements like header, footer and
section are used.
But that this information is spread along these meta element attributes
content or the content of the pages indicates even more, that a 
version indication like version="HTML5" is missing. 
Is it expected, that such an informations always appears within
the content or the description or keywords of a 'HTML5' document?
Or is it intended, that the used elements are analysed to identify 'HTML5'?
(I think, the gallery itself does not use elements specific for HTML5).
Is it defined in the current draft how to indicate 'HTML5' with meta 
elements?
Looks currently like poor design of 'HTML5' - and that many of them 
indicate the fact, that they use 'HTML5' with such workarounds looks 
pretty much like a gap in the current draft ;o) Why else to note several
times within the document the used version in different ways, 
surely because they try to assure, that their documents are really
identified somehow as 'HTML5' ;o)


>
> > How do they identify them as 'HTML5'? and distinguish from undefined
> > tag soup without a version indication?
>
> That is not needed.

Well, this dicussion and the samples you provided indicate, that there
is at least a desire for a version indication - maybe to get well defined
documents, maybe to show, that the author is a cool and funky designer 
using already languages, which are still drafts, even if it is not known,
how to indicate, that they really use this cool new language version ;o)


>
> > Not, how these document are maybe interpreted today, what does the
> > author indicate, what they are?
>
> I'm not sure what you mean.
>

Because languages like XHTML+RDFa, HTML4, 'HTML5', SVG,
MathML, SMIL, RDF etc define somehow, what the meaning of the 
content of an element is, and different versions define it (slightly)
different, the meaning can be only derived by knowing the version.
Sometimes, if the functionality changes too, the intended behaviour
can only be derived by knowing the version.
To create more than currently cool and funky designer pages it is
therefore important for some authors to indicate, what they really
mean, not just, how things appear. And if elements are defined
in different language versions to have different meanings, a version
indication is required.
Without it is still possible to write those single-serving pretty nice 
and cool and funky designer pages beeing pretty nice and cool 
and funky for a month or a year, because the majority of author 
of those pages do not care about details. For more, you have to
care about details. 

As already mentioned, 'HTML5' defines the meaning of elements
like small slightly different from other (X)HTML versions
(for small it is different from my typical use cases, which are 
currently excluded by the new definition), therefore
a version indication for the complete document is important or
an author has to use the microdata on each element to indicate,
to which definition it belongs. Maybe one can set a microdata
information about the version on the root html element, 
referencing the current draft, this may work currently as an
unambiguos version indication. This would implicate, that
the child elements belong to the same version, if not indicated
otherwise with even more microdata. 



> > Maybe in ten years they are interpreted as 'HTML6' - does it mean, that
> > an author has written them as 'HTML6' today?
>
> I assume that once we'll get to HTML 6 we make sure to do the same we did
> for HTML 5. That is studying existing content, implementations, etc. and
> go from there. If we succeed in what we want with HTML 5, HTML 6 will not
> have to make incompatible changes.
>

Who knows?
The next generation may think, that semantical details are important
and will reject all this creating a completely new language version 
avoiding all these historical meanders and it turns out, that every 
document without an unambiguous indication should be rejected as 
stupid tag soup ;o)


> > If I have written some HTML tag soup in 1997, does it mean, that this is
> > 'HTML5', just because this tag soup has no version indication? Does it
> > mean in 10 years, that I have written 'HTML6' already in 1997?
>
> Currently <!doctype html> is required for something to be considered HTML
> 5. However, all HTML is consumed using the algorithm defined in the
> specification. (Implementations have always done this, though have
> differences between them because not everything was defined back in the
> days.)
>

I think, this is not completely wrong for several HTML versions, and not
for XHTML or XHTML+RDFa and maybe the best choice too for 
documents having an XHTML:html element as root element, but 
elements from several other namespaces as well, maybe including 
entity definitions within the doctype. 


> >> This assumes versioning is necessary. Experience with Web browsers shows
> >> that this is not needed and avoids a lot of complexity.
> >
> > Web browser only interprete documents, and the common experience shows,
> > that they do it wrong and incomplete. And of course, it helps a lot to
> > have a look into the source code, find a well defined, structured
> > document with
> > version indication for the interpretation, if the browser fails. One can
> > try to find the problem - more often bugs in the document than a real
> > problem with the browser. And one can try to find out, what was intended
> > by the author. For flawed documents, the intention of the author is often
> > different from the interpretation of the browser, what is basically a
> > task for the author to fix the bugs - one cannot expect, that a simple
> > program
> > like a browser can guess, what was intented by the author of  a flawed
> > document. And even if some error management is defined in 'HTML5',
> > one cannot assume, that the result is often close to the intention of the
> > author, it is just one interpretation of a flawed document.
>
> I do not see how this refutes my point. Requiring browsers to do more
> complex things will certainly not result in them doing less wrong or do
> things less incomplete.


No, but if the audience is able to study the souce code and to derive,
what the author did and intended, this works often much better, because
the human intellectual capabilities are typically superior to that of a
simple browser (even if the history of HTML authors seem to implicate,
that one should not overestimate human capabilities ;o)
Well, unfortunately due to the limitations of capabilities I have to look
quite often in the source code - and if discussed with the author how
to markup texts in a technical and semantical meaningful way,
it helps simply to know instead of guessing what the author
tried to create and it shortens the discussion.
And maybe in the far future, if there is still some interest in my
documents it will hopefully help the audience to understand
some of my intentions, if I use proper semantical markup and
languages with well defined versions, which can be indicated
within a document. 

>
> >>>> No, you can just specify it. Just like you can in HTML4.
> >>>
> >>> I can write the string, but indeed, if I do it, it means
> >>
> >> 'Windows-1252'.
> >>
> >> Not for your authoring tool or a conformance checker.
> >>
> >>> Therefore effectively, I cannot indicate, that something is
> >>> 'ISO-8859-1' and not 'Windows-1252'.
> >>
> >> You cannot indicate that something needs to be decoded as ISO-8859-1 by
> >> Web browsers for text/html content. This has been the case for a long
> >> time and is nothing new.
> >
> > It is at least new in 'HTML5', that one cannot indicate 'ISO-8859-1'
> > as encoding, what is a difference, again the problem to mix up the
> > content of a document with the interpretation of it.
>
> I think you misunderstand the specification. It is an error if a document
> is labeled ISO-8859-1 and does not conform to ISO-8859-1. I said that
> above as well.
>


It is noted:
"
When a user agent would otherwise use an encoding given in the first column of 
the following table to either convert content to Unicode characters or 
convert Unicode characters to bytes, it must instead use the encoding given 
in the cell in the second column of the same row.
"

This indicates clearly, that the encoding information (not only the decoding)
is changed by this rule.
Therefore within the current 'HTML5' draft it is equivalent to
specify 'ISO-8859-1' or 'Windows-1252'.

And there is no indication, that this only applies to the text/html variant
of 'HTML5' and not to the application/xhtml+xml variant too
(you claimed in a previous mail, that it only applies to text/html - 
where did you find that?)


> > There is no way for an author to prevent any interpretation. But the
> > author can try to indicate, what was intented.
>
> Right.
>
> > And as I already mentioned, I have seen years ago already browsers
> > without this behaviour/bug, what was somehow interesting for
> > documents with wrong encoding information and unmasked euro signs ;o)
>
> There's a reason those browsers are no longer around.

I think, it was a Konqueror and a Mozilla-Suite on Debian, I don't
remember exactly which versions, but not so old as several
tutorials about HTML and CSS I found within the last 
years including sophisticated hints about related issues.
Received on Wednesday, 5 August 2009 09:27:44 UTC