Re: Auto-detect and encodings in HTML5 from Leif Halvard Silli on 2009-06-02 (public-html@w3.org from June 2009)

From: Leif Halvard Silli <lhs@malform.no>
Date: Tue, 02 Jun 2009 02:48:37 +0200
To: Maciej Stachowiak <mjs@apple.com>
CC: Geoffrey Sneddon <foolistbar@googlemail.com>, Larry Masinter <masinter@adobe.com>, Anne van Kesteren <annevk@opera.com>, Chris Wilson <Chris.Wilson@microsoft.com>, "M.T. Carrasco Benitez" <mtcarrascob@yahoo.com>, Travis Leithead <Travis.Leithead@microsoft.com>, Erik van der Poel <erikv@google.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Harley Rosnow <Harley.Rosnow@microsoft.com>
Message-ID: <4A2476E5.5090601@malform.no>

Maciej Stachowiak On 09-06-02 00.38:
> On Jun 1, 2009, at 2:09 PM, Geoffrey Sneddon wrote:
>> On 1 Jun 2009, at 19:37, Larry Masinter wrote:
>>
>>> New behavior: IF you see, say, <doctype html5> THEN assume default 
>>> charset
>>> is UTF8, rather than applying heuristics to guess charset.
> 
> The result of not applying heuristics would be Windows-1252 - that is 
> the default except in the rare cases where the heuristics find a match. 
> I still don't understand why disabling the heuristics should be tied to 
> changing the default from Windows-1252 to UTF-8.

Is it the choice of UTF-8 as default you don't understand? If so, 
then I'd like to quote the "Support World Languages" principle.

>> If you see it how? You need to have read the encoded string to see 
>> such a string.
>>
>>> Yes, supplying explicit charset is preferable, but what would break
>>> if such a new rule were supplied?
>>
>> The problem is that any HTML 5 content served as text/html will be 
>> treated as Windows-1252 by all existing user agents and UTF-8 by new 
>> ones, which is problematic and will lead to problems (as people tend 
>> to only test in one browser, and if it works in one browser assume it 
>> works everywhere) as it is hence inconsistent.
> 
> Good point. The Degrade Gracefully design principle says:
> 
> "On the World Wide Web, authors are often reluctant to use new language 
> features that cause problems in older user agents, or that do not 
> provide some sort of graceful fallback. HTML 5 document conformance 
> requirements should be designed so that Web content can degrade 
> gracefully in older or less capable user agents, even when making use of 
> new elements, attributes, APIs and content models."
> 
> Making the doctype switch the default from Windows-1252 to UTF-8 will 
> mean only ASCII documents work correctly in both older and newer user 
> agents, unless the author explicitly declares an encoding. If you have 
> to explicitly declare a UTF-8 charset to get UTF-8, then nothing has 
> been gained for careful authors. But unaware authors face an unexpected 
> hazard.

There is one aspect that you are - again - forgetting, and that is 
  authoring tools and web servers.

If complying authoring tools had to default to UTF-8 whenever 
someone select to create a HTML 5 document (much the same way that 
XML default to UTF-8/-16), then that would be a bonus and 
simplification and _motivation_ for using HTML 5.

The next level should be that web servers defaults to sending a 
charset header which said "UTF-8" whenever they saw the HTML 5 
doctype.

Thus we could leave the Web browser behaviour as drafted, but 
require utf-8 as default from serves and authoring tools.
-- 
leif halvard silli

Received on Tuesday, 2 June 2009 00:49:23 UTC