RE: [HTML5] 2.8 Character encodings

On Thu, 30 Jul 2009, Larry Masinter wrote:
> The character equivalence tables in section 2.7 should be scrapped, or 
> put in a separate "legacy content compatibility guide".
> Broken legacy content *does* disappear, [...]

Upon further consideration, I have concluded that in practice, there is 
far too much such content for it to disappear, and therefore have left 
these tables in the draft as normative requirements. All implementations 
I'm aware of either implement these requirements or have known bugs 
handling content on the Web.

> [...] and building the Hypertext Markup Language in which the normative 
> conformance requirements are restricted to those that will be useful 
> even in controlled environments.

I don't understand what this means.

> What percentage of web sites mislabel EUC-KR as windows-949, for this to 
> be a MUST requirement in HTML5?

I do not have this information immediately available, but such data would 
certainly help determine whether we should continue to support this 

Note that even 0.1% is a huge amount when you consider that there are on 
the order of a trillion pages on the Web.

> The "copy/paste" use case where broken content makes its way into new 
> web pages and web applications does not apply.

It seems to apply; why do you think it does not?

> The charset equivalence tables do not apply anyway to browsers which do 
> not support the charsets for which equivalents are supplied.


> If HTML5 only requires two charsets, then requiring support for 
> equivalence tables is nonsensical.

How so?

On Thu, 30 Jul 2009, Larry Masinter wrote:
> My request:
> The definition of "charset" in the HTML 4.01 specification is much more 
> legible and understandable, and the current draft's language is opaque. 
> Readopt most of HTML 4.01 section 5.2 text; it would be a great 
> improvement in legibility.

As far as I can tell, while HTML4's text may be more legible, it is also 
woefully inaccurate and vague. Could you elaborate on what parts of HTML5 
are opaque? Maybe I can just improve them directly.

On Fri, 31 Jul 2009, Dr. Olaf Hoffmann wrote:
> 1. How to indicate the 'ISO-8859-1' encoding within an 'HTML5' document 
> and not 'Windows-1252', if an author wants to specify 'ISO-8859-1' and 
> nothing else?

Unless you use control characters, which are not valid, then specifying 
"ISO-8859-1" will result in the browser's behaviour being 
indistinguishable from what you are asking for. Therefore, the answer is 
just "specify ISO-8859-1", as in:

   <meta charset="ISO-8859-1">


   <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

> 2. How does a proper viewer/browser identify, that a document is 'HTML5' 
> and that this specific rule has to be applied, if 'ISO-8859-1' is 
> indicated.

There is no way to tell a browser to treat a file as ISO-8859-1 and not 
Windows-1252. However, unless the file is invalid and contains control 
characters, there is also no difference in the processing.

> 3. At which point the encoding information switches from the information 
> given by the server or the XML processing instruction to the specific 
> rule of 'HTML5' to interprete the string 'ISO-8859-1' as indication for 
> 'Windows-1252'?

When the user agent invokes a character encoding convertor.

On Sat, 1 Aug 2009, Dr. Olaf Hoffmann wrote:
> There is no indication, that this might be 'HTML5'. Therefore no
> specific rule from the 'HTML5' draft needs to be applied.

"text/html" is the indication that the HTML5 spec applies. (Or at least, 
that will be the case once we update the text/html registration.)

Ian Hickson               U+1047E                )\._.,--....,'``.    fL       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Wednesday, 12 August 2009 00:47:33 UTC