Re: Auto-detect and encodings in HTML5 from Geoffrey Sneddon on 2009-06-01 (public-html@w3.org from June 2009)

From: Geoffrey Sneddon <foolistbar@googlemail.com>
Date: Mon, 1 Jun 2009 22:09:17 +0100
To: Larry Masinter <masinter@adobe.com>
Cc: Anne van Kesteren <annevk@opera.com>, Chris Wilson <Chris.Wilson@microsoft.com>, Maciej Stachowiak <mjs@apple.com>, "M.T. Carrasco Benitez" <mtcarrascob@yahoo.com>, Travis Leithead <Travis.Leithead@microsoft.com>, Erik van der Poel <erikv@google.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Harley Rosnow <Harley.Rosnow@microsoft.com>
Message-Id: <02AFC71A-9E5D-435B-833C-ABCE9FC2D666@googlemail.com>

On 1 Jun 2009, at 19:37, Larry Masinter wrote:

> New behavior: IF you see, say, <doctype html5> THEN assume default  
> charset
> is UTF8, rather than applying heuristics to guess charset.

If you see it how? You need to have read the encoded string to see  
such a string.

> Yes, supplying explicit charset is preferable, but what would break
> if such a new rule were supplied?

The problem is that any HTML 5 content served as text/html will be  
treated as Windows-1252 by all existing user agents and UTF-8 by new  
ones, which is problematic and will lead to problems (as people tend  
to only test in one browser, and if it works in one browser assume it  
works everywhere) as it is hence inconsistent.

--
Geoffrey Sneddon
<http://gsnedders.com/>

Received on Monday, 1 June 2009 21:10:09 UTC