Re: [HTML5] 2.8 Character encodings

Anne van Kesteren:
> On Wed, 05 Aug 2009 10:58:43 +0200, Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de> 
wrote:
> > No, I think, both CSS and 'HTML5' share the same problem without
> > a version indication. However, CSS is just about styling, in doubt
> > one can switch it off. (X)HTML is about content and it is more important
> > to have a well defined and not plurivalent meaning of elements
> > (or encoding or other things related to content).
>
> If there is no versioning the latest version of the standard defines the
> meaning of elements. So that problem does not exist.
>

As already mentioned, this is the source of the problem. 
If the document does not indicate, to which version the markup
belongs, the meaning is not stable or maybe the creation date
has to be used to guess, what the meaning was, when the author
created the document. The meaning/content of a document does
not change, just because a new version specification of the used
markup language appears. 
Especially because different interest groups can publish different
versions of HTML at any time, an author cannot predict the 
meaning of his own documents, if there is no relation provided to 
the specification relevant for the current document. 
Authors wanting to have a meaningful and time independent
markup for long living documents, they have to use a version 
with version indication and they cannot profit from the current 
efforts of the HTML5-WG.
Previous (X)HTML versions with version indication can still be
used for this purpose, but each version without an indication
method is excluded due to the gap, that one cannot indicate
the relation between the current document and the used
language version (if there is more than one). 


> >> This is the same point as above. Authors do not write against
> >> specifications.
> >
> > Well not all authors ...
> > As an author I started to write a specification, because (X)HTML did
> > not cover, what I wanted to markup ;o)
>
> Writing _against_ a specification and writing _a_ specification are very
> different things.

Of course. First someone has to write a specification, then one can
refer to it and write documents using this. It is the main reason for
such specification, that people are enabled to write documents in
the specified format. Without a relation between the single document
and the specification, the markup of the document remains meaningless
or the meaning is plurivalent or arbitrary.
If things are undefined, one can write a specification to be able
to write well defined documents (what is the main goal - but if no
one else has done it, one first has to create the tools, before one
can start the intended work).

You wrote 'Authors do not write against specifications.'
I can simply disprove this, because I am an author and I
do write documents as specified (by others or by myself).

Maybe we can affirm that some authors write documents
as specified, but much more authors do not care about
specifications, but believe in the interpretation/presentation
of their preferred browser. 
Anyway, finally a specification covers documents from
both types of authors, especially 'HTML5' due to the advanced
error management.


>
> > And those people with microformats and RDF(a) indicate, that not
> > all authors want to write things, already appearing somehow in
> > common browsers, but have no semantical well defined meaning ;o)
>
> I do not understand this sentence.

They create methods and structures to be able to indicate the
intended meaning by pointing directly to a scheme or a definition.
If done for all content, this can replace a version indication and
the used language itself does not need any meaningful elements
anymore, because the meaning is defined by the directly 
referenced definitions. Obviously it is much more convenient
to refer to a complete language version for the complete document
and only to refer to advanced definitions for those fragments the
language itself has no meaningful elements. 

>
> > Those authors, who care mainly about the appearance in current
> > browsers typically do not have long living documents in mind, even
> > if with some luck some of these documents remain several years.
>
> You make it sound like this is a fact, but this is not at all my
> experience.

Well I have a long experience discussing real problems of
authors in forums. Many of those authors are on a very low
knowledge level, but often they want to have everything
immediately and without any efforts ;o) 
And the majority of authors write for now or this year 
(no matter, how long those documents later really remain on 
the server). Their planning horizon is close to zero.
If they manage to learn some more details, some 
of them become more careful to write reusable, 'valid' 
content to be able to save time if they publish 
it again in a completely new structure or with another 
PHP-script one or two years later. Often the authors of
these PHP-scripts seem to be on a low knowledge level
too for (X)HTML, not necessarily for PHP itself.
There are only a few of them writing documents for 'eternity',
the majority believes, that they have to adjust it anyway 
from time to time to the current behaviour of browsers
or at least the style to the current taste. Because some
fraction cannot really separate style from content (implicated
too by some restrictions of CSS), they have to care about 
the content each time they change the styling.

Additionally and unfortunately still electronic documents published 
in the internet do not have the reputation to be reliable and in 
the same way referencable as paper books or journals are;
such authors may be one of the reasons for this low
reputation.
Another issue is of course, that it is typically no problem
to read a paper book 50 years after publication, for an
electronic format it is. One of the tasks of formats like
(X)HTML can be to show, that documents still have
a well defined meaning after 50 or more years.
The chances are now not so bad with proper
encoding information and documents with version
indication for the used markup.
But because up to now, no electronic format survived
such a long time, the majority of authors do not have
such a long planning horizon.


>
> > Or they do not have much experience and rely to much on the
> > behaviour of the current version of their preferred browser.
>
> Right, which is why better interoperability is important.

This is scorched earth due to the experiences with different
behaviour in the last ten or more years. Browsers behaved
quite different and authors wasted a lot of time especially
with differences in CSS interpretation. 
Browser still behave differently for HTML, for example only 
some of them provide a complete access to meta information, 
links, alternate stylesheets, cite information for quotes, date
information for del and ins.
Even if for one or two formats those problems are assumed 
to be widely solved, it will take many years to convince authors, 
that this is really the case.


>
> > Within the last ten years such authors typically had a lot of work
> > updating a few documents to the current behaviour of the newest
> > browser version. I think, this is not really a promising perspective
> > for electronic formats (as for computer hardware this is no
> > standard, this permanent handicraft work).
>
> Do you have any evidence for this? Documents from 10 years ago typically
> still render fine.
>

Ok, for HTML in those days authors typically used that fraction, what
was already implemented. But for example the change from the
Mozilla suite to firefox (many do not use SeaMonkey) implicated some
restrictions for example for navigation/menus using link elements.
Of course, there are much more issues due to changes and
bug fixes for CSS or redefinitions in CSS2.1. For some styles I had
to adjust my own projects from time to time, now either I write more
stable styles or the behaviour of common browsers stabilises.


> >> There is
> >>
> >>    http://html5gallery.com/
> >>
> >> for instance, which collects sites made in HTML5.
> >
> > Indeed, within the content, the meta description or keywords it often
> > appears, that those projects are intended to be HTML5. For several
> > others it can be guessed, because elements like header, footer and
> > section are used.
>
> You mean "not used"? Those elements are part of HTML5.

No, it can be guessed, that they use 'HTML5', because other
versions do not have these elements. Currently 'HTML5' is the
only version having these elements (maybe with the exception
of some of them, which appeared already in the early XHTML2
drafts). However, apart from the doctype one can write an 'HTML5'
document only with elements already defined for 'HTML4' or earlier 
versions, therefore the collection of used elements is no indication.
(But you did not claim that and I just explained, why the version
cannot be derived from the element collection to pronounce, that
this is not the way to indicate or to identify the usage of 'HTML5'
in a document.)


>
> > But that this information is spread along these meta element attributes
> > content or the content of the pages indicates even more, that a
> > version indication like version="HTML5" is missing.
>
> I don't see why.


This is only an interpretation of an observation - several of these
pages claim to be 'HTML5' within the content or the meta elements
or indicated somehow a relation to the keyword 'HTML5'.
One reason could be, that there is no other way like a version
indication to provide an unambiguous relation between the
single document and the version 'HTML5', but those authors
have the desire to indicate the relation. They are convinced,
that 'HTML5' is useful and want to express somehow, that
they already use it. Maybe more for psychological reasons 
it seems to matter to indicate, that they (already) use this 
new version, even if parts of the 'HTML5 WG' seem to assume, 
that no one cares. Obviously these authors care.
If it would not have been important for those authors to indicate 
the relation, they would not have indicated it. 
Because it was created, it can be assumed that it has a purpose 
for those authors.


>
> > Is it expected, that such an informations always appears within
> > the content or the description or keywords of a 'HTML5' document?
> > Or is it intended, that the used elements are analysed to identify
> > 'HTML5'?
>
> The version does not need to be identified. As I said before, HTML is
> versionless.

HTML not and not XHTML, only 'HTML5' currently, other versions 
have a version indication.
And if there is a need or desire to identify a version, depends on the 
audience of each single document, this cannot be generalised with
an arbitrary claim. If someone like me has use cases for identification,
this generates a need or desire.

>
> > (I think, the gallery itself does not use elements specific for HTML5).
> > Is it defined in the current draft how to indicate 'HTML5' with meta
> > elements?
>
> That would introduce versioning, so no.
>
> > Looks currently like poor design of 'HTML5' - and that many of them
> > indicate the fact, that they use 'HTML5' with such workarounds looks
> > pretty much like a gap in the current draft ;o) Why else to note several
> > times within the document the used version in different ways,
> > surely because they try to assure, that their documents are really
> > identified somehow as 'HTML5' ;o)
>
> Why would they try to assure that their documents are identified as HTML5?

Why I want to indicate a version, I already explained, why they indicate
a relation explicitly, you have to ask them ;o)

> Browsers process HTML in the same way regardless, so it does not matter.
> I've said all this before though and I'm feeling this discussion is just
> going in circles.

In parts, but mainly, because you still mix up interpretation of content
and content (or its meaningful representation). If you don't understand
the difference, there will be indeed nothing to learn for you and no
progress in the discussion.

>
> >> > How do they identify them as 'HTML5'? and distinguish from undefined
> >> > tag soup without a version indication?
> >>
> >> That is not needed.
> >
> > Well, this dicussion and the samples you provided indicate, that there
> > is at least a desire for a version indication - maybe to get well defined
> > documents, maybe to show, that the author is a cool and funky designer
> > using already languages, which are still drafts, even if it is not known,
> > how to indicate, that they really use this cool new language version ;o)
>
> Could you maybe indicate what you mean? I do not get the same impression as
> you looking at the source code of those sites.
>

See above, several of these projects referenced in the gallery explictely
indicate themselves somehow as 'HTML5' or indicate some relation to
this keyword.

> >>> Not, how these document are maybe interpreted today, what does the
> >>> author indicate, what they are?
> >>
> >> I'm not sure what you mean.

See above, the problem with the difference between interpretation
of content and content or the markup of content to indicate a semantical
meaning or intention. I think, this is one major progress in conservation
of information, after people learned to memorise information using lyrics or
poetry, to write text, to structure written text with paragraphs, headings
etc, to use machines to print books and to save information in an electronic
way. But information does not exist independently from the conservation
methods, there has to be always a cultural agreement or alternatively
a specification, how the information is conserved to be able to extract
it again or to interprete it, in doubt independently from specific 
currently available programs. And because today we have different 
methods to conserve information in electronic formats, one has to 
indicate somehow, to which specification the markup of the current 
document is related. 


> >
> > Because languages like XHTML+RDFa, HTML4, 'HTML5', SVG,
> > MathML, SMIL, RDF etc define somehow, what the meaning of the
> > content of an element is, and different versions define it (slightly)
> > different, the meaning can be only derived by knowing the version.
>
> No, the meaning can only be derived by asking the author. The version does
> not have much to do with it. E.g. lots of authors abuse <blockquote> for
> indenting and others abuse longdesc="" for search engine spam. That is not
> the meaning of those HTML features.
>

Here you mix up errors or indifference or ignorance of authors with
the content of a document. Once an author has decided to use a
markup language with a specified meaning and publishes such a
document, the meaning of the document can be derived from the 
document. Indeed this can be different form the intentions of the author.
And of course, sometimes you just derive the information, that
the author is an indifferent ignorant or a cretin or that the author
simulates a cretin. But if an author indicates something as a
blockquote, it is a blockquote. 
And if something is indicated as a longdesc, this is the equivalent
for the related image - if any browser would make this information
together with the image available for everyone, it would be simple for
many people to identify the author as a spammer comparing
image and text. And it would surely help authors to improve the
content of such alternative descriptions, because they become
accessible for everyone. Surely this would already reduce the abuse,
because spammers typically do not want to be discovered as 
spammers by the 'ordinary' human audience. 
Another example would be an object with flash and as alternative
text only some advertisement link to the flash player from adobe.
The derived meaning is simply, that the flash document is only
advertisement for the flash player and does not contain further
information. If the audience is not interested in such a player,
there is no need to install or to activate it. This is a method I
use already a long time simply to save time and traffic.
Once you start to believe in markup, this is an effective filter
to sort out already a lot of nonsense around.


> > Sometimes, if the functionality changes too, the intended behaviour
> > can only be derived by knowing the version.
> > To create more than currently cool and funky designer pages it is
> > therefore important for some authors to indicate, what they really
> > mean, not just, how things appear. And if elements are defined
> > in different language versions to have different meanings, a version
> > indication is required.
>
> Maybe in an ideal world this would be the case, but given that nobody wants
> to implement versioning, versioning makes things vastly more complex, and
> older specifications (e.g. HTML 4 and CSS 2.0) are very poorly written and
> ambiguous, this is unlikely to happen.
>

Implementation in current simple browsers is another question, as
interpretation is - as we already mentioned, there is no way for authors
to force a specific interpretation. 
But nevertheless it is still relevant to indicate, what was intended.
For example years later a successor may discover, that current 
versions of browser do something strange (this happend quite
often especially for styling in the last 10 years). 
Having a version indication, reading the old specification it is still
possible to workaround the problem and to republish the document 
with a new version, better interpreted by current browsers. 
Without a version indication one can just guess and there is only 
a smaller, more time consuming chance to reconstuct the intended 
behaviour.
There are several use cases for an unambiguous relation between
a single document and a specification. And there is no need to
know them all or to predict all possible use cases of the future, 
it is information available for different purposes and not just for a 
specific tool.


> To do a step back, do you have an example of an HTML 4 page you once
> created that would get "weird meaning" in HTML 5?

I already noted the small element - in HTML4 (and already HTML3.2)
this was often used in our section, often together with sup and sub
to indicate properties of atomic and molecular states, symmetries,
indices, chemical formulas etc. 
'HTML5' restricts the meaning and excludes such use cases.
Of course, it introduces now parts of MathML one can use
instead and with a more related semantical meaning
(would be one of my favourite test cases for 'HTML5' currently). 
But this is only available, if those old documents are converted 
into 'HTML5'. In HTML4 it was only possible to derive the
meaning from surrounding content and cultural agreement,
not from the specification, but it was a possible usage. 
And this was identificable both with the presentation and the
markup itself.
Similar things may appear for other elements having now
a more restricted meaning or content model than in HTML4.
Fortunately HTML4 has a version indication, therefore the more 
restrictive (not necessarily bad) definitions of 'HTML5' do not apply
and the documents do not have to be updated to preserve the
intended meaning. 
Without a version indication and with your idea, that the
latest version defines the meaning, one has to update
those old documents once the 'HTML5' draft becomes
a specification for HTML.


>
> >> Currently <!doctype html> is required for something to be considered
> >> HTML 5. However, all HTML is consumed using the algorithm defined in the
> >> specification. (Implementations have always done this, though have
> >> differences between them because not everything was defined back in the
> >> days.)
> >
> > I think, this is not completely wrong for several HTML versions, and not
> > for XHTML or XHTML+RDFa and maybe the best choice too for
> > documents having an XHTML:html element as root element, but
> > elements from several other namespaces as well, maybe including
> > entity definitions within the doctype.
>
> I do not understand this sentence.

For example XHTML+RDFa does not need a doctype, nevertheless
you can use <!doctype html ...> for example to specify entities, if you
need them. And if you have a compound document you often have
no DTD and you can use this doctype indication too.
Because it mainly says, that html is the root element, you can use
it for any HTML version, not just for 'HTML5' (of course for most
of them there is no version indication anymore, once the information
within this doctype is lost, indeed from our current point of view it
was not very clever to combine the version indication with the DTD
information for some previous (X)HTML versions - the classical problem
of using one screw for two tasks).

Received on Wednesday, 5 August 2009 16:18:59 UTC