Re: [HTML5] 2.8 Character encodings from Anne van Kesteren on 2009-08-05 (public-html-comments@w3.org from August 2009)

From: Anne van Kesteren <annevk@opera.com>
Date: Wed, 05 Aug 2009 12:08:45 +0200
To: "Dr. Olaf Hoffmann" <Dr.O.Hoffmann@gmx.de>, public-html-comments@w3.org
Message-ID: <op.ux6o4v1864w2qv@annevk-t60>
On Wed, 05 Aug 2009 10:58:43 +0200, Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de> wrote:
> No, I think, both CSS and 'HTML5' share the same problem without
> a version indication. However, CSS is just about styling, in doubt
> one can switch it off. (X)HTML is about content and it is more important
> to have a well defined and not plurivalent meaning of elements
> (or encoding or other things related to content).

If there is no versioning the latest version of the standard defines the meaning of elements. So that problem does not exist.


>> This is the same point as above. Authors do not write against
>> specifications.
>
> Well not all authors ...
> As an author I started to write a specification, because (X)HTML did
> not cover, what I wanted to markup ;o)

Writing _against_ a specification and writing _a_ specification are very different things.


> And those people with microformats and RDF(a) indicate, that not
> all authors want to write things, already appearing somehow in
> common browsers, but have no semantical well defined meaning ;o)

I do not understand this sentence.


> Those authors, who care mainly about the appearance in current
> browsers typically do not have long living documents in mind, even
> if with some luck some of these documents remain several years.

You make it sound like this is a fact, but this is not at all my experience.


> Or they do not have much experience and rely to much on the
> behaviour of the current version of their preferred browser.

Right, which is why better interoperability is important.


> Within the last ten years such authors typically had a lot of work
> updating a few documents to the current behaviour of the newest
> browser version. I think, this is not really a promising perspective
> for electronic formats (as for computer hardware this is no
> standard, this permanent handicraft work).

Do you have any evidence for this? Documents from 10 years ago typically still render fine.


>> There is
>>
>>    http://html5gallery.com/
>>
>> for instance, which collects sites made in HTML5.
>
> Indeed, within the content, the meta description or keywords it often
> appears, that those projects are intended to be HTML5. For several
> others it can be guessed, because elements like header, footer and
> section are used.

You mean "not used"? Those elements are part of HTML5.


> But that this information is spread along these meta element attributes
> content or the content of the pages indicates even more, that a
> version indication like version="HTML5" is missing.

I don't see why.


> Is it expected, that such an informations always appears within
> the content or the description or keywords of a 'HTML5' document?
> Or is it intended, that the used elements are analysed to identify  
> 'HTML5'?

The version does not need to be identified. As I said before, HTML is versionless.


> (I think, the gallery itself does not use elements specific for HTML5).
> Is it defined in the current draft how to indicate 'HTML5' with meta
> elements?

That would introduce versioning, so no.


> Looks currently like poor design of 'HTML5' - and that many of them
> indicate the fact, that they use 'HTML5' with such workarounds looks
> pretty much like a gap in the current draft ;o) Why else to note several
> times within the document the used version in different ways,
> surely because they try to assure, that their documents are really
> identified somehow as 'HTML5' ;o)

Why would they try to assure that their documents are identified as HTML5? Browsers process HTML in the same way regardless, so it does not matter. I've said all this before though and I'm feeling this discussion is just going in circles.


>> > How do they identify them as 'HTML5'? and distinguish from undefined
>> > tag soup without a version indication?
>>
>> That is not needed.
>
> Well, this dicussion and the samples you provided indicate, that there
> is at least a desire for a version indication - maybe to get well defined
> documents, maybe to show, that the author is a cool and funky designer
> using already languages, which are still drafts, even if it is not known,
> how to indicate, that they really use this cool new language version ;o)

Could you maybe indicate what you mean? I do not get the same impression as you looking at the source code of those sites.


>>> Not, how these document are maybe interpreted today, what does the
>>> author indicate, what they are?
>>
>> I'm not sure what you mean.
>
> Because languages like XHTML+RDFa, HTML4, 'HTML5', SVG,
> MathML, SMIL, RDF etc define somehow, what the meaning of the
> content of an element is, and different versions define it (slightly)
> different, the meaning can be only derived by knowing the version.

No, the meaning can only be derived by asking the author. The version does not have much to do with it. E.g. lots of authors abuse <blockquote> for indenting and others abuse longdesc="" for search engine spam. That is not the meaning of those HTML features.


> Sometimes, if the functionality changes too, the intended behaviour
> can only be derived by knowing the version.
> To create more than currently cool and funky designer pages it is
> therefore important for some authors to indicate, what they really
> mean, not just, how things appear. And if elements are defined
> in different language versions to have different meanings, a version
> indication is required.

Maybe in an ideal world this would be the case, but given that nobody wants to implement versioning, versioning makes things vastly more complex, and older specifications (e.g. HTML 4 and CSS 2.0) are very poorly written and ambiguous, this is unlikely to happen.

To do a step back, do you have an example of an HTML 4 page you once created that would get "weird meaning" in HTML 5?


> Without it is still possible to write those single-serving pretty nice
> and cool and funky designer pages beeing pretty nice and cool
> and funky for a month or a year, because the majority of author
> of those pages do not care about details. For more, you have to
> care about details.

I care about details.


> As already mentioned, 'HTML5' defines the meaning of elements
> like small slightly different from other (X)HTML versions
> (for small it is different from my typical use cases, which are
> currently excluded by the new definition), therefore
> a version indication for the complete document is important or
> an author has to use the microdata on each element to indicate,
> to which definition it belongs. Maybe one can set a microdata
> information about the version on the root html element,
> referencing the current draft, this may work currently as an
> unambiguos version indication. This would implicate, that
> the child elements belong to the same version, if not indicated
> otherwise with even more microdata.

I suppose you could come up with a convention for yourself, yes.


>> I assume that once we'll get to HTML 6 we make sure to do the same we  
>> did for HTML 5. That is studying existing content, implementations, etc. and
>> go from there. If we succeed in what we want with HTML 5, HTML 6 will  
>> not have to make incompatible changes.
>
> Who knows?

Nobody, but that's the idea.


> The next generation may think, that semantical details are important
> and will reject all this creating a completely new language version
> avoiding all these historical meanders and it turns out, that every
> document without an unambiguous indication should be rejected as
> stupid tag soup ;o)

Sure. Hopefully they study the history (e.g. with DOCTYPE switching) and realize it works poorly.


>> Currently <!doctype html> is required for something to be considered  
>> HTML 5. However, all HTML is consumed using the algorithm defined in the
>> specification. (Implementations have always done this, though have
>> differences between them because not everything was defined back in the
>> days.)
>
> I think, this is not completely wrong for several HTML versions, and not
> for XHTML or XHTML+RDFa and maybe the best choice too for
> documents having an XHTML:html element as root element, but
> elements from several other namespaces as well, maybe including
> entity definitions within the doctype.

I do not understand this sentence.


>> I do not see how this refutes my point. Requiring browsers to do more
>> complex things will certainly not result in them doing less wrong or do
>> things less incomplete.
>
> No, but if the audience is able to study the souce code and to derive,
> what the author did and intended, this works often much better, because
> the human intellectual capabilities are typically superior to that of a
> simple browser (even if the history of HTML authors seem to implicate,
> that one should not overestimate human capabilities ;o)
> Well, unfortunately due to the limitations of capabilities I have to look
> quite often in the source code - and if discussed with the author how
> to markup texts in a technical and semantical meaningful way,
> it helps simply to know instead of guessing what the author
> tried to create and it shortens the discussion.

In case of HTML they simply tried to write HTML and the latest specification of HTML gives the guidelines for that. Just like with CSS.


> And maybe in the far future, if there is still some interest in my
> documents it will hopefully help the audience to understand
> some of my intentions, if I use proper semantical markup and
> languages with well defined versions, which can be indicated
> within a document.

Since we want to remain backwards compatible with old sites I think that should be no problem.


>> I think you misunderstand the specification. It is an error if a  
>> document is labeled ISO-8859-1 and does not conform to ISO-8859-1. I said that
>> above as well.
>
> It is noted:
> "When a user agent would otherwise use an encoding given in the first  
> column of the following table to either convert content to Unicode characters or
> convert Unicode characters to bytes, it must instead use the encoding  
> given in the cell in the second column of the same row."
>
> This indicates clearly, that the encoding information (not only the  
> decoding) is changed by this rule.

This only applies to user agents as is clearly stated. Since you seem to care primarily about document meaning and conformance checkers (which have to report mismatches) I do not see how you consider that to be a problem.


> Therefore within the current 'HTML5' draft it is equivalent to
> specify 'ISO-8859-1' or 'Windows-1252'.

No. The former gives a character encoding error in conformance checkers if you use characters from the latter. (The former is a subset of the latter. I realize you said to Julian that this is not the case, but the preferred IANA name for the standard you pointed out _is_ ISO-8859-1.)


> And there is no indication, that this only applies to the text/html  
> variant of 'HTML5' and not to the application/xhtml+xml variant too
> (you claimed in a previous mail, that it only applies to text/html -
> where did you find that?)

Good point. That used to be the case. I think it got lost due to restructuring of sections. I filed a bug:

  http://www.w3.org/Bugs/Public/show_bug.cgi?id=7215


>>> And as I already mentioned, I have seen years ago already browsers
>>> without this behaviour/bug, what was somehow interesting for
>>> documents with wrong encoding information and unmasked euro signs ;o)
>>
>> There's a reason those browsers are no longer around.
>
> I think, it was a Konqueror and a Mozilla-Suite on Debian, I don't
> remember exactly which versions, but not so old as several
> tutorials about HTML and CSS I found within the last
> years including sophisticated hints about related issues.

If you can find pointers that'd be cool.


-- 
Anne van Kesteren
http://annevankesteren.nl/
Received on Wednesday, 5 August 2009 10:09:28 UTC