[whatwg] Default encoding to UTF-8?

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 6 Dec 2011 05:54:03 +0100
Message-ID: <20111206055403247098.b406cadc@xn--mlform-iua.no>
Boris Zbarsky Mon Dec 5 19:18:10 PST 2011:
> On 12/5/11 9:55 PM, Leif Halvard Silli wrote:

>> I said I agreed with him that Faruk's solution was not good. However, I
>> would not be against treating <DOCTYPE html> as a 'default to UTF-8'
>> declaration
> This might work, if there hasn't been too much cargo-culting yet.  Data 
> urgently needed!

Yeah, it would be a pity if it had already become an widespread 
cargo-cult to - all at once - use HTML5 doctype without using UTF-8 
*and* without using some encoding declaration *and* thus effectively 
relying on the default locale encoding ... Who does have a data corpus? 
Henri, as Validator.nu developer?

This change would involve adding one more step in the HTML5 parser's 
encoding sniffing algorithm. [1] The question then is when, upon seeing 
the HTML5 doctype, the default to UTF-8 ought to happen, in order to be 
useful. It seems it would have to happen after the processing of the 
explicit meta data (Step 1 to 5) but before the last 3 steps - step 6, 
7 and 8:

Step 6: 'if the user agent has information on the likely encoding'
Step 7: UA 'may attempt to autodetect the character encoding'
Step 8: 'implementation-defined or user-specified default'

The role of the HTML5 DOCTYPE, encoding wise, would then be to ensure 
that step 6 to 8 does not happen. 

[1] http://dev.w3.org/html5/spec/parsing#encoding-sniffing-algorithm
Leif H Silli
