- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 28 May 2009 10:42:35 +0300
- To: Jungshik SHIN (신정식) <jshin1987+w3@gmail.com>
- Cc: Erik van der Poel <erikv@google.com>, Travis Leithead <Travis.Leithead@microsoft.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Chris Wilson <Chris.Wilson@microsoft.com>, Harley Rosnow <Harley.Rosnow@microsoft.com>, Simon Montagu <smontagu@smontagu.org>, ap@webkit.org
On May 27, 2009, at 21:37, Jungshik SHIN (신정식) wrote: > 2009/5/27 Erik van der Poel <erikv@google.com> >> However, I object quite strongly to the UTF-8 default. If an HTML5 >> document includes the doctype but excludes the charset, old clients >> might use their auto-detector and get it wrong. So I'd prefer to make >> the charset mandatory with HTML5 doctype, and keep the rule that the >> HTTP charset overrides the META charset for compatibility with old >> clients. When the document has non-ASCII bytes, an explicit encoding declaration (or BOM) is required for document conformance. But implementations still need to deal with the violation of the requirement. > As far as I know (Simon will correct me if I'm not up-to-date), > Firefox's charset autodetctor kicks in only when both of the > following two conditions are satisfied: > > 1) Auto-detection is turned on explicitly by a user. It's OFF by > default > 2) No charset is specified anywhere. > > Even if it's turned ON, Firefox does honor the explicitly specified > charset (http or meta). This also holds true in the HTML5 parser-enabled Gecko builds currently. The difference is how far the heuristic detector looks when it does kick in. > I'm tempted to go a step further to forbid ISO-2022-XX and GB-HZ as > well, but there might be a compatibility concern here. However, if > that prohibition is triggered by HTML5 doctype, it should be ok. The decoder needs to be instantiated before the doctype is parsed. Changing this would be pain. Let's not make the encoding stuff dependent on doctype. > There are some web sites with meta tags deeply buried ( > 512 bytes > from the beginning). Webkit even has a layout test for this > (currently, it scans the first 1024 bytes). The HTML5 parsing algorithm deals with the late <meta> case by causing a renavigation to the document. The question is how far the prescan should look. Philip's data shows the diminishing returns have kicked in even before 512. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Thursday, 28 May 2009 07:43:17 UTC