Re: Auto-detect and encodings in HTML5

Maciej Stachowiak On 09-06-02 00.38:
> On Jun 1, 2009, at 2:09 PM, Geoffrey Sneddon wrote:
>> On 1 Jun 2009, at 19:37, Larry Masinter wrote:
>>> New behavior: IF you see, say, <doctype html5> THEN assume default 
>>> charset
>>> is UTF8, rather than applying heuristics to guess charset.
> The result of not applying heuristics would be Windows-1252 - that is 
> the default except in the rare cases where the heuristics find a match. 
> I still don't understand why disabling the heuristics should be tied to 
> changing the default from Windows-1252 to UTF-8.

Is it the choice of UTF-8 as default you don't understand? If so, 
then I'd like to quote the "Support World Languages" principle.

>> If you see it how? You need to have read the encoded string to see 
>> such a string.
>>> Yes, supplying explicit charset is preferable, but what would break
>>> if such a new rule were supplied?
>> The problem is that any HTML 5 content served as text/html will be 
>> treated as Windows-1252 by all existing user agents and UTF-8 by new 
>> ones, which is problematic and will lead to problems (as people tend 
>> to only test in one browser, and if it works in one browser assume it 
>> works everywhere) as it is hence inconsistent.
> Good point. The Degrade Gracefully design principle says:
> "On the World Wide Web, authors are often reluctant to use new language 
> features that cause problems in older user agents, or that do not 
> provide some sort of graceful fallback. HTML 5 document conformance 
> requirements should be designed so that Web content can degrade 
> gracefully in older or less capable user agents, even when making use of 
> new elements, attributes, APIs and content models."
> Making the doctype switch the default from Windows-1252 to UTF-8 will 
> mean only ASCII documents work correctly in both older and newer user 
> agents, unless the author explicitly declares an encoding. If you have 
> to explicitly declare a UTF-8 charset to get UTF-8, then nothing has 
> been gained for careful authors. But unaware authors face an unexpected 
> hazard.

There is one aspect that you are - again - forgetting, and that is 
  authoring tools and web servers.

If complying authoring tools had to default to UTF-8 whenever 
someone select to create a HTML 5 document (much the same way that 
XML default to UTF-8/-16), then that would be a bonus and 
simplification and _motivation_ for using HTML 5.

The next level should be that web servers defaults to sending a 
charset header which said "UTF-8" whenever they saw the HTML 5 

Thus we could leave the Web browser behaviour as drafted, but 
require utf-8 as default from serves and authoring tools.
leif halvard silli

Received on Tuesday, 2 June 2009 00:49:21 UTC