Re: [HTML5] 2.8 Character encodings from Dr. Olaf Hoffmann on 2009-08-16 (public-html-comments@w3.org from August 2009)

From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
Date: Sun, 16 Aug 2009 15:53:17 +0200
To: public-html-comments@w3.org
Message-Id: <200908161553.17369.Dr.O.Hoffmann@gmx.de>
Ian Hickson:
....
>
> On Wed, 12 Aug 2009, Dr. Olaf Hoffmann wrote:
> > The meaning of some elements is different in 'HTML5' as well or is
> > defined in a more restrictive way, what excludes some use cases possible
> > in HTML4.
>
> Yes, but in practice that's not an issue since HTML5 describes how HTML4
> UAs actually did things.
>

User agents present the elements somehow, often this does not directly
imply a meaning.
And if we take (again, I already discussed this with Anne) the sample
of the element small, the presentation implicates no specific meaning,
what is ok for HTML4, because the definition does not imply a specific
meaning either. The audience has to derive this from relations to the
content around it. 'HTML5' defines a meaning for small.
Therefore the 'HTML5' definition does not apply for all use cases
in HTML4 documents, just to a subset. 
However, because user agents do not need to care about the meaning,
the presentation may not differ. Authors have to care about the meaning
and cannot use the element in 'HTML5' for some use cases. 
This is not necessarily a problem for authors, if there are other elements
intented in 'HTML5' for their use case and because HTML4 and 'HTML5'
are different versions and looking at the version indication (doctype) one 
can at least indentify, when authors use the HTML4 definitions. As long as
'HTML5' has no version indication, there is no simple way to indicate,
that the definition in 'HTML5' applies.

Or the meaning of cite is defined more precisely, what is ok for
a new version, but not applicable for the usage in HTML4 documents.
Ok - what really a proper content is, depends on several things,
if simply a private communication or a dictum is quoted, it has no
title and one has to note the name of the person. 
If a work has specific authors, both title and authors and maybe
a source or a unique identifier may belong to the citation information. 

acronym - non conforming feature in the current draft, well
defined in HTML4. 
I think, with the instead recommended element abbr there is a problem
with other (legacy?) versions of MSIE.
Obviously here the 'HTML5' draft does not include an explanation of
the meaning of HTML4 documents and does not necessarily 
do a better job concerning the description of the interpretation of legacy
viewers. 

The content model of dl is more restrictive in 'HTML5' - surely
it cannot describe uses of the less restrictive model of HTML4.
And viewers have no problem to present such uses, therefore
'HTML5' may have a better definition of definition lists, excluding
some not very nice use cases, but it does not describe several
really existing HTML4 documents or how they are presented
by current viewers.

I think, for object some attributes are missing.
Well, some authors used some of them wrong and 
something like declare was not widely implemented.
Both does not indicate directly a problem with the HTML4
definition. I think, there is still no declarative method in
the draft to start some time dependent content of object,
therefore declare is really missing in 'HTML5', not only
for object. However, if an author uses it in a HTML4
document, one cannot expect that the behaviour of
a browser ignoring this attribute is that, what was
intented by the author ;o)
The implementation gap simply excludes some use
cases of object in practice - maybe one of the reason,
why there is currently a lot of strange content around,
trying to simulate such functionality somehow to work 
around the gap.


I think, there are several more samples, all of them show, that
'HTML5' does not describe all 'valid' HTML4 documents properly.
I do not think, 'HTML5' has to do this, because it is a new version
of the language. It is just pretty useless to disclaim such simple
facts and incompatibilities.


> > And has far as I have seen, those changes are not mentioned
> > in the current draft (as well as maybe some missing attributes).
> > If we take the sample of the version attribute itself, it does not
> > define what it means, HTML4 for example does.
>
> HTML4's statements on the matter are inconsistent with actual
> implementations and legacy content.

I cannot see, what is inconsistent here:

"version = cdata [CN]
Deprecated. The value of this attribute specifies which HTML DTD version 
governs the current document. This attribute has been deprecated because it 
is redundant with version information provided by the document type 
declaration.
"

This does not even suggest a specific use of the attribute or that
the interpretation or presentation of a simple browser must depend 
on such an information.

Of course, it was not clever to use the DTD for two purposes - 
to provide a file for validators to check the document structure and
to indicate the used version.
For example SVG (1.1, 1.2 tiny) or XHTML+RDFa avoid this problem
for this attribute.
SVG has additionally the baseProfile attribute to indicate the variant.


SVG has some differences for example between 1.1 and 1.2 tiny
too and some proposed errata for known problems.
Of course, if there is no logical inconsistence, just a change
of behaviour, the errata/change mainly indicates to authors,
that the feature is not really usable for the next years due to
the inconsistencies and it is not obvious how to derive, which
meaning is intented by the author, if there is an errata. 
If the document is older than the errata or the author does not
recognise the errata, of course the change due to the errata
is not the intended meaning. The behaviour of old browsers
is not related to the behaviour defined by the errata. Therefore 
it is mainly an indication for authors not to use the complete
feature, because it is corrupted. 
If there are known logical inconsistencies for a feature, the
best an author can do is anyway to avoid the feature ;o)

The situation is even worse, if an unknown future version
of the used language could change the meaning of the
currently used elements. This simply corrupts the complete 
language and the best an author can do, is to use a less 
stupid language to create well defined documents ;o) 

How current browsers really interprete or present old
documents (due to some practical limitations) is a different
question than the ability of an author to express, which
language version is used, because in doubt one can
look into the source code to find out, what was really
intended.

In the description of a 'HTML5' user agent, one could
indicate, that this program has only the capability
to interprete 'HTML5' completely and tries to generate
some useful presentation for previous and future language
versions of HTML. With such an information the audience
knows, that there can be some problems with older 
or newer documents, the program cannot solve. This
is already pretty good and helpful and much more honest
as to claim, the user agent is almighty just due to the
'HTML5' draft, which tends to brush such problems under 
the carpet. Of course, such problems are not solvable
for such simple programs or not with finite expenses.
And I think no author or user expects infinite efforts
from implementors or a new HTML version to solve all
historical problems. 


>
> > This simply shows, that the current 'HTML5' draft does not indicate, how
> > to interprete previous versions of HTML documents in general, it
> > indicates only, how to interprete 'HTML5' documents and maybe how often
> > used current browsers interprete current HTML documents (what can be
> > wrong or incomplete).
>
> If you want to be told how to interpret legacy content that is
> contemporary with HTML4, then HTML5 does a better job than HTML4.
>

The proper way to interprete old documents is first to look into the
source code, whether it is worth to care about or not and then to
look in the HTML4 specification manually to derive the proper
interpretation. Surely, one will never have a look into the 'HTML5'
draft to derive the meaning of a HTML4 document.
I will not look into a latin-dutch dictionary to translate a latin text
into german either ;o)
And if I have a german book from 1970 I do no expect that the
newest german orthography rules are applicable - would be
ridiculous ;o)
 


> > A current draft cannot change the meaning of a previous
> > specification/recommendation and it does not change the meaning of
> > documents written in this previous language version.
>
> Actually, it can, when the older specification was incorrect.

How can it be incorrect, if the semantical meaning of the content
of an element is defined?
It can contain logical inconsistencies of course. But this is somehow
typical for languages and one can still work with most less affected
features of the language without any problem.
If something was implemented differently in viewers, this is not
necessarily an indication, that the specification is incorrect. Often
this is an indication, that the viewers are incorrect, maybe for some
good reasons, but this is another question.

If we have a look again on the primary problem of this 
discussion, the problem how to indicate a specific encoding,
there might be a good reason, why some viewers ignore/modify
the information provided by authors. But a curious section in the
'HTML5' draft does not change the fact, that this is wrong for
HTML4 - and if there is no way for an author to indicate, that
'HTML5' is used in the document, it is not applicable and 
therefore still a wrong behaviour for a document without a
version indication of 'HTML5'.

One cannot change the simple fact, that several versions of
HTML and XHTML really exist just by disclaiming it.
Of course, if there is no version indication for 'HTML5', 
authors can simply not indicate directly and in a simple
way, that the definitions and rules of 'HTML5' are intended
for the current document, what would be a great pity for
all the efforts of so many people working and discussing
this new language version.


Olaf
Received on Sunday, 16 August 2009 14:51:20 UTC