Re: [HTML5] 2.8 Character encodings

On Tue, 04 Aug 2009 17:12:29 +0200, Dr. Olaf Hoffmann  
<Dr.O.Hoffmann@gmx.de> wrote:
> And it is of course a problem for the interpretation of older documents,
> at least, if the creation time is not conserved. Obviously a stylesheet
> from 2000 does not mean CSS2.1, for 2007 no one knows if 2.0 or 2.1.

Actually, it probably does. Because actual deployed style sheets and  
implementations in Web browsers of CSS 2.0 were what guided the  
development of CSS 2.1. This is the same for HTML 4 and HTML 5.


>> This is one way of looking at the problem. Another way of looking at the
>> problem is that specifications cannot incrementally evolve like
>> implementations do and are therefore not always accurate in what they  
>> say.
>
> For this example, both CSS2.0 and CSS2.1 have understandable
> and testable definitions of the same property, they are just different
> (by the way, the behaviour I found for Opera is different from both ;o)

Please file a bug if you have not already. Thanks! The CSS WG considers  
CSS 2.0 to be mostly obsolete by the way. You can see that the /TR/CSS2/  
recently changed to point to CSS 2.1 as well.

I am just trying to explain the state of affairs with respect to CSS here.  
While you can claim it is supposed to be different, that does not make it  
so.


> Without version indication this is no evolution, this simply means,
> that is has no relyable meaning (and a different interpretation in
> different versions of browsers as well). Both meaning and
> interpretation get unpredictable for authors, the second at least
> for several years.

I think this is mostly because both HTML 4 and CSS 2.0 lacked rigorous  
testing of implementations and taking implementation and author feedback  
into account. With HTML 5 and CSS 2.1 we are trying to fix this mistake.


>> Just like normal languages Web languages change now and then.
>
> But not the versions and not the intended meaning of already
> published documents. If this would be the case, it would be
> impossible to write well defined documents/content in an electronic
> format at all. Maybe the interpretations changes, because we
> have other experiences now than when such a document was
> written, but this does not change, what the author had in mind,
> when the document was created.

Actually, what we find again and again is that what the majority of  
authors has in mind is the browser they are rendering their document or  
style sheet in. Not what the specification said. The latter only affects  
test suites and it is not worth preserving compatibility with those.


>> > For 'HTML5' - as long as I cannot simply write version="HTML5"
>> > I cannot start to write HTML5 documents.
>>
>> In effect your documents will be treated as HTML5 regardless.
>
> Here you mix up again interpretation and content.
> If you like, you can interprete a 'microsoft word' document or
> a postscript or PDF as well as 'HTML5', this does not necessarily
> mean, that you get the intended meaning of the document with
> such an interpretation and it is not obvious, how to interprete this
> in a useful way, if author/server send them as text/html.
> This can already happen with translations of documents from one
> language to another (Shakespeare in german, Goethe in english?),
> it is obviously even worse, if you try to interprete such documents
> in the other language without a translation - but of course, you can
> do it ;o)

This is the same point as above. Authors do not write against  
specifications.


>> > Already this is a 'show stopper' for 'HTML5' currently. One can still
>> > discuss the
>> > current draft, but for formal reasons one cannot write a
>> > 'HTML5' document ;o)
>>
>> As far as I know there are no formal reasons why one cannot write HTML5
>> documents and publish them and in fact many people are authoring HTML5
>> documents and publishing them.
>
> Well do you have samples?

There is

   http://html5gallery.com/

for instance, which collects sites made in HTML5.


> How do they identify them as 'HTML5'? and distinguish from undefined
> tag soup without a version indication?

That is not needed.


> Not, how these document are maybe interpreted today, what does the
> author indicate, what they are?

I'm not sure what you mean.


> Maybe in ten years they are interpreted as 'HTML6' - does it mean, that
> an author has written them as 'HTML6' today?

I assume that once we'll get to HTML 6 we make sure to do the same we did  
for HTML 5. That is studying existing content, implementations, etc. and  
go from there. If we succeed in what we want with HTML 5, HTML 6 will not  
have to make incompatible changes.


> If I have written some HTML tag soup in 1997, does it mean, that this is
> 'HTML5', just because this tag soup has no version indication? Does it
> mean in 10 years, that I have written 'HTML6' already in 1997?

Currently <!doctype html> is required for something to be considered HTML  
5. However, all HTML is consumed using the algorithm defined in the  
specification. (Implementations have always done this, though have  
differences between them because not everything was defined back in the  
days.)


>> This assumes versioning is necessary. Experience with Web browsers shows
>> that this is not needed and avoids a lot of complexity.
>
> Web browser only interprete documents, and the common experience shows,
> that they do it wrong and incomplete. And of course, it helps a lot to  
> have a look into the source code, find a well defined, structured  
> document with
> version indication for the interpretation, if the browser fails. One can  
> try to find the problem - more often bugs in the document than a real
> problem with the browser. And one can try to find out, what was intended
> by the author. For flawed documents, the intention of the author is often
> different from the interpretation of the browser, what is basically a  
> task for the author to fix the bugs - one cannot expect, that a simple  
> program
> like a browser can guess, what was intented by the author of  a flawed
> document. And even if some error management is defined in 'HTML5',
> one cannot assume, that the result is often close to the intention of the
> author, it is just one interpretation of a flawed document.

I do not see how this refutes my point. Requiring browsers to do more  
complex things will certainly not result in them doing less wrong or do  
things less incomplete.


>>>> No, you can just specify it. Just like you can in HTML4.
>>>
>>> I can write the string, but indeed, if I do it, it means
>> 'Windows-1252'.
>>
>> Not for your authoring tool or a conformance checker.
>>
>>> Therefore effectively, I cannot indicate, that something is
>>> 'ISO-8859-1' and not 'Windows-1252'.
>>
>> You cannot indicate that something needs to be decoded as ISO-8859-1 by
>> Web browsers for text/html content. This has been the case for a long  
>> time and is nothing new.
>
> It is at least new in 'HTML5', that one cannot indicate 'ISO-8859-1'
> as encoding, what is a difference, again the problem to mix up the
> content of a document with the interpretation of it.

I think you misunderstand the specification. It is an error if a document  
is labeled ISO-8859-1 and does not conform to ISO-8859-1. I said that  
above as well.


> There is no way for an author to prevent any interpretation. But the  
> author can try to indicate, what was intented.

Right.


> And as I already mentioned, I have seen years ago already browsers
> without this behaviour/bug, what was somehow interesting for
> documents with wrong encoding information and unmasked euro signs ;o)

There's a reason those browsers are no longer around.


-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Tuesday, 4 August 2009 16:40:50 UTC