Re: [HTML5] 2.8 Character encodings

On Tue, 28 Jul 2009, Dr. Olaf Hoffmann wrote:
> > >
> > > 1. Which string has an author to note, if he really wants to 
> > > indicate, that the encoding is for example 'ISO-8859-1' and not 
> > > 'Windows-1252'?
> >
> > "ISO-8859-1". If the author has really used that encoding, then there 
> > is no difference between them (1252 is a superset).
> 
> I know, that there is only a difference for a few characters, not 
> relevant for ISO-8859-1 usage (see tables at wikipedia for example). 
> However, because I prefer to provide only well defined documents in 
> publications, I want to be sure, that if ISO-8859-1 is mentioned within 
> a document, that this really means ISO-8859-1 and not something else.

If you use ISO-8859-1 correctly (i.e. no control characters) then there is 
no way to distinguish the behaviour you want from the behaviour the spec 
describes if you label your document as ISO-8859-1.


> This is a problem, because 'ISO-8859-1' is the default encoding for
> HTML4 for example.

HTML4 has no default encoding.


> Therefore to switch from project still having HTML4 (and no XHTML 
> already) to 'HTML5' seems to require to switch to UTF-8 to avoid 
> plurivalences with the encoding due to the current draft.

If you're using ISO-8859-1 correctly, you can continue using it without 
any change whatsoever.


> > > 2. As far as I have seen, HTML5 has no version indication like 
> > > previous versions of HTML had and other popular formats like SVG 
> > > have. How can a browser identify, that a document is really intended 
> > > as 'HTML5' with the implicated 'willful' misinterpretations of 
> > > encoding information and no other HTMLversion?
> >
> > It doesn't matter, all versions of HTML are in practice processed with 
> > these mappings. It is indeed why HTML5 has these mappings -- because 
> > browsers already did this. We wouldn't add these mappings if we didn't 
> > have to to handle legacy content (content in previous versions of 
> > HTML).
> 
> Well for HTML4 and XHTML1.x and all other XML formats this is simple
> a bug of the browser, nothing to worry about for authors, because the
> string 'ISO-8859-1' has a well defined meaning in all these formats 
> completely independent from the behaviour of current buggy browsers.

A bug that is done the same way by all browsers is a de facto standard, 
and unless you don't really care about how your document is seen by your 
readers, it is something you have to worry about.



> And if some authors are forced by this bug to indicate 'cp 1252' as 
> 'ISO-8859-1' I think, this is even more and indication, that this bug 
> has to be fixed in browsers to inform those authors to fix their 
> documents to get a well defined encoding for their document instead of 
> hiding such a bug to prevent authors to fix such a nasty bug.

That would be wonderful, but we can't get there from here since it would 
cause pages to break and thus browsers refuse to do it.


> Especially 'ISO-8859-1' authors currently do not have to worry about 
> the bug, because they (typically) do not use the characters with 
> different meaning in both encodings.

Indeed, as they are control characters there is no reason that they 
should ever use those characters at all.


> With the current 'HTML5' draft already the indication as 'ISO-8859-1' is 
> plurivalent and has to be avoided to create a well defined document in 
> the format 'HTML5' (or 'HTML5' has to be avoided to create a well 
> defined document).

It's well-defined, just not defined the way you would like.


> If a server, an XML-processing instruction or maybe a meta-element 
> indicates the encoding as 'ISO-8859-1' a proper browser has to encode
> the document with 'ISO-8859-1' (with the implication that some characters
> defined differently in 'Windows-1252' do not have a useful graphical or
> acoustical representation in 'ISO-8859-1'). If within the processing, these
> hypothetical proper browser is able to detect somehow, that the current
> document is 'HTML5', the browser has to switch the encoding and some
> characters may be interpreted differently (maybe including a useful
> graphical or acoustical representation). 

I assume you mean "decoding", not "encoding".

A proper HTML5 browser will treat ISO-8859-1 as Windows-1252 for any 
text/html processing. There is no need for a magical detection ability.


> But 'HTML5' cannot redefine how to interprete encoding information for 
> other formats or versions (HTML4, XHTML1, SVG, MathML, RDF, DAISY, 
> FictionBook etc)

HTML5 redefines how to interpret encoding information for HTML4, as it 
replaces it. It does not affect processing of XML or other formats.

(In practice, what HTML5 requires is what was implemented anyway for 
HTML4, so there is no actual difference.)


> If it is known, that many buggy browser use the wrong encoding, of 
> course, this can be mentioned in the 'HTML5' draft as an (important) 
> informational note for authors to be careful, but the draft should not 
> redefine the meaning of the string incompatible to any other format.

Yes, it should. I'm documenting what is actually implemented, and making 
sure we have interoperable behaviour, and making sure that new tools will 
work with legacy content interoperably with legacy user agents and future 
user agents. This requires a candid approach and cannot be achieved by 
beating around the bush in a politically correct way about what should 
ideally have happened vs what actually happens in the real world.

Yes, this means ISO-8859-1's control characters can't be expressed in 
HTML documents. Tough. Deal with it. That's how implementations are, 
that's what legacy content relies on, and pretending otherwise is a big 
waste of everyone's time.


> Indeed, though there was never a browser interpreting 
> HTML4 complete and correct

Exactly. There wasn't. There _will_ be one that implements HTML5 
completely and correctly.


> Containing too many plurivalences, bloomers and historical stupidities 
> due to browser bugs, 'HTML5' will never be useful to create well defined 
> documents.

It's already useful for this purpose. Being honest about what happens 
makes it _more_ useful.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 28 July 2009 19:33:07 UTC