On Jun 1, 2009, at 2:09 PM, Geoffrey Sneddon wrote:
>
> On 1 Jun 2009, at 19:37, Larry Masinter wrote:
>
>> New behavior: IF you see, say, <doctype html5> THEN assume default
>> charset
>> is UTF8, rather than applying heuristics to guess charset.
The result of not applying heuristics would be Windows-1252 - that is
the default except in the rare cases where the heuristics find a
match. I still don't understand why disabling the heuristics should be
tied to changing the default from Windows-1252 to UTF-8.
>
> If you see it how? You need to have read the encoded string to see
> such a string.
>
>> Yes, supplying explicit charset is preferable, but what would break
>> if such a new rule were supplied?
>
> The problem is that any HTML 5 content served as text/html will be
> treated as Windows-1252 by all existing user agents and UTF-8 by new
> ones, which is problematic and will lead to problems (as people tend
> to only test in one browser, and if it works in one browser assume
> it works everywhere) as it is hence inconsistent.
Good point. The Degrade Gracefully design principle says:
"On the World Wide Web, authors are often reluctant to use new
language features that cause problems in older user agents, or that do
not provide some sort of graceful fallback. HTML 5 document
conformance requirements should be designed so that Web content can
degrade gracefully in older or less capable user agents, even when
making use of new elements, attributes, APIs and content models."
Making the doctype switch the default from Windows-1252 to UTF-8 will
mean only ASCII documents work correctly in both older and newer user
agents, unless the author explicitly declares an encoding. If you have
to explicitly declare a UTF-8 charset to get UTF-8, then nothing has
been gained for careful authors. But unaware authors face an
unexpected hazard.
Regards,
Maciej