Re: Auto-detect and encodings in HTML5 from Maciej Stachowiak on 2009-06-01 (public-html@w3.org from June 2009)

From: Maciej Stachowiak <mjs@apple.com>
Date: Mon, 01 Jun 2009 15:38:47 -0700
To: Geoffrey Sneddon <foolistbar@googlemail.com>
Cc: Larry Masinter <masinter@adobe.com>, Anne van Kesteren <annevk@opera.com>, Chris Wilson <Chris.Wilson@microsoft.com>, "M.T. Carrasco Benitez" <mtcarrascob@yahoo.com>, Travis Leithead <Travis.Leithead@microsoft.com>, Erik van der Poel <erikv@google.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Harley Rosnow <Harley.Rosnow@microsoft.com>
Message-id: <298E47A1-5404-463E-A2D3-216083969108@apple.com>

On Jun 1, 2009, at 2:09 PM, Geoffrey Sneddon wrote:

>
> On 1 Jun 2009, at 19:37, Larry Masinter wrote:
>
>> New behavior: IF you see, say, <doctype html5> THEN assume default  
>> charset
>> is UTF8, rather than applying heuristics to guess charset.

The result of not applying heuristics would be Windows-1252 - that is  
the default except in the rare cases where the heuristics find a  
match. I still don't understand why disabling the heuristics should be  
tied to changing the default from Windows-1252 to UTF-8.

>
> If you see it how? You need to have read the encoded string to see  
> such a string.
>
>> Yes, supplying explicit charset is preferable, but what would break
>> if such a new rule were supplied?
>
> The problem is that any HTML 5 content served as text/html will be  
> treated as Windows-1252 by all existing user agents and UTF-8 by new  
> ones, which is problematic and will lead to problems (as people tend  
> to only test in one browser, and if it works in one browser assume  
> it works everywhere) as it is hence inconsistent.

Good point. The Degrade Gracefully design principle says:

"On the World Wide Web, authors are often reluctant to use new  
language features that cause problems in older user agents, or that do  
not provide some sort of graceful fallback. HTML 5 document  
conformance requirements should be designed so that Web content can  
degrade gracefully in older or less capable user agents, even when  
making use of new elements, attributes, APIs and content models."

Making the doctype switch the default from Windows-1252 to UTF-8 will  
mean only ASCII documents work correctly in both older and newer user  
agents, unless the author explicitly declares an encoding. If you have  
to explicitly declare a UTF-8 charset to get UTF-8, then nothing has  
been gained for careful authors. But unaware authors face an  
unexpected hazard.

Regards,
Maciej

Received on Monday, 1 June 2009 22:39:29 UTC