RE: [HTML5] 2.8 Character encodings from Larry Masinter on 2009-07-20 (public-html@w3.org from July 2009)

From: Larry Masinter <masinter@adobe.com>
Date: Mon, 20 Jul 2009 07:01:55 -0700
To: Ian Hickson <ian@hixie.ch>, "Dr. Olaf Hoffmann" <Dr.O.Hoffmann@gmx.de>
CC: HTML WG <public-html@w3.org>
Message-ID: <8B62A039C620904E92F1233570534C9B0118D7F3F233@nambx04.corp.adobe.com>
What the document should say, rather than having  a 'willful'
misinterpretation, is that ISO-8859-1 means ISO-8859-1, but that
for backward compatibility with existing (broken) web content,
HTTP interpreting agents SHOULD treat characters outside of the
ISO-8859-1 repertoire as if they were in Windows-1252.

This would allow and encourage HTML validators and HTML generation
software to use the correct interpretation without a 'willful'
disregard for compatibility with other standards and processing
agents outside of the scope of the specifications of this
committee. IMHO, the willful disregard for compatibility with other
specifications in the current specification reflects a consistent 
error in judgment.

I reject as an unsound design principle the notion that merely
because there exist some broken web content today that we are
forced to encode that broken behavior in HTML forever. Yes,
HTML interpreting agents that wish to be compatible with existing
content will need to apply some additional constraints and
extensions, but it is unnecessary, and poor design, to fail
to distinguish between advice to interpreting agents as to
backward-compatibility behavior vs. advice to generating and
authoring agents as to proper forward-looking behavior.

Larry
--
http://larry.masinter.net


-----Original Message-----
From: public-html-comments-request@w3.org [mailto:public-html-comments-request@w3.org] On Behalf Of Ian Hickson
Sent: Monday, July 20, 2009 1:57 AM
To: Dr. Olaf Hoffmann
Cc: public-html-comments@w3.org
Subject: Re: [HTML5] 2.8 Character encodings

On Mon, 6 Jul 2009, Dr. Olaf Hoffmann wrote:
> 
> in the current draft are mentioned in 2.8
> http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0
> some 'willful' misinterpretations of encoding information, for example 
> to interprete a string like 'ISO-8859-1' as 'Windows-1252'.
>
> 1. Which string has an author to note, if he really wants to indicate, that
> the encoding is for example 'ISO-8859-1' and not 'Windows-1252'?

"ISO-8859-1". If the author has really used that encoding, then there is 
no difference between them (1252 is a superset).


> 2. As far as I have seen, HTML5 has no version indication like previous
> versions of HTML had and other popular formats like SVG have.
> How can a browser identify, that a document is really intended as
> 'HTML5' with the implicated  'willful' misinterpretations of encoding
> information and no other HTMLversion?

It doesn't matter, all versions of HTML are in practice processed with 
these mappings. It is indeed why HTML5 has these mappings -- because 
browsers already did this. We wouldn't add these mappings if we didn't 
have to to handle legacy content (content in previous versions of HTML).


> Assuming that a viewer is able to identify a document somehow being a 
> HTML5 document after looking into the content and for example a server 
> sended 'ISO-8859-1' before, does this mean, that the viewer switches to 
> or reparses the document with 'Windows-1252' again?

I don't understand the question.


> Obviously it would be better to avoid such misinterpretation by using an 
> encoding like UTF-8 not confused by the current HTML5 draft, however due 
> to the history of older projects or server configurations it might be 
> still convenient for many authors to continue to use 'ISO-8859-1' 
> instead of other encodings, even if they switch for example from HTML4 
> to HTML5 for some documents.

Hopefully my answers above will reassure you that this is not in fact a 
problem that authors will face.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 20 July 2009 14:02:40 UTC