Re: Auto-detect and encodings in HTML5 from Henri Sivonen on 2009-05-28 (www-international@w3.org from April to June 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 28 May 2009 10:42:35 +0300
To: Jungshik SHIN (신정식) <jshin1987+w3@gmail.com>
Cc: Erik van der Poel <erikv@google.com>, Travis Leithead <Travis.Leithead@microsoft.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Chris Wilson <Chris.Wilson@microsoft.com>, Harley Rosnow <Harley.Rosnow@microsoft.com>, Simon Montagu <smontagu@smontagu.org>, ap@webkit.org
Message-Id: <BC222B84-D39F-411B-AF80-685CF2D6F07F@iki.fi>

On May 27, 2009, at 21:37, Jungshik SHIN (신정식) wrote:

> 2009/5/27 Erik van der Poel <erikv@google.com>
>> However, I object quite strongly to the UTF-8 default. If an HTML5
>> document includes the doctype but excludes the charset, old clients
>> might use their auto-detector and get it wrong. So I'd prefer to make
>> the charset mandatory with HTML5 doctype, and keep the rule that the
>> HTTP charset overrides the META charset for compatibility with old
>> clients.

When the document has non-ASCII bytes, an explicit encoding  
declaration (or BOM) is required for document conformance. But  
implementations still need to deal with the violation of the  
requirement.

> As far as I know (Simon will correct me if I'm not up-to-date),  
> Firefox's charset autodetctor kicks in only when both of the  
> following two conditions are satisfied:
>
> 1) Auto-detection is turned on explicitly by a user. It's OFF by  
> default
> 2) No charset is specified anywhere.
>
> Even if it's turned ON, Firefox does honor the explicitly specified  
> charset (http or meta).

This also holds true in the HTML5 parser-enabled Gecko builds  
currently. The difference is how far the heuristic detector looks when  
it does kick in.

> I'm tempted to go a step further to forbid ISO-2022-XX and GB-HZ as  
> well, but there might be a compatibility concern here. However, if  
> that prohibition is triggered by HTML5 doctype, it should be ok.

The decoder needs to be instantiated before the doctype is parsed.  
Changing this would be pain. Let's not make the encoding stuff  
dependent on doctype.

> There are some web sites with meta tags deeply buried ( > 512 bytes  
> from the beginning). Webkit even has a layout test for this  
> (currently, it scans the first 1024 bytes).

The HTML5 parsing algorithm deals with the late <meta> case by causing  
a renavigation to the document. The question is how far the prescan  
should look. Philip's data shows the diminishing returns have kicked  
in even before 512.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 28 May 2009 07:43:17 UTC