Re: Running heuristic encoding detection from Henri Sivonen on 2011-02-16 (public-html@w3.org from February 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 16 Feb 2011 03:10:45 -0800 (PST)
To: HTML WG <public-html@w3.org>
Message-ID: <1325442227.359209.1297854644971.JavaMail.root@cm-mail03.mozilla.org>
> We only enable it for certain locales based on some combination of
> domain
> name, Content-Language, and encoding of the referring document.

Wow. Does any other browser look at the domain name or Content-Language? Is Content-Language used enough (when not using the charset parameter in Content-Type!) and correctly enough to be useful?

FWIW, in Firefox the encoding of a same-origin referring document (incl. frame parent) takes precedence over heuristic detection.

> When
> enabled it looks at the entire resource. I am not sure what our
> timeout
> strategy is if any. This is an area we would like to improve at some
> point. It would be nice if more were standardized or at least
> documented.

In the spirit of documentation, for Firefox 4 the following was done:

 * "Universal" heuristic detector is no longer enabled by default in any locale (AFAICT). It was previously enabled by default for the Swedish UI locale!

 * Locale-specific heuristic detectors are enabled by default for CJK and Cyrillic UI locales. That is, the Japanese UI locale has a Japanese-specific detector, the Russian UI locale has a Russian-specific detector, the Ukranian UI locale has a Ukranian-specific detector, etc.

 * The user can change the settings for which heuristic detector is enabled (if any).

 * Regardless of a heuristic detector being enabled, if the document is at least 30 bytes long, up to 1024 first bytes are sniffed for being UTF-16LE or UTF-16BE-encoded Basic Latin-only text. This makes Firefox 4 "work" with unlabeled BOMless Basic Latin-only documents that "work" in IE because IE drops zero bytes before tokenizing while carefully not making unlabeled BOMless UTF-16BE/LE pages that don't already "work" in IE "work" (to avoid proliferating unlabeled BOMless UTF-16 badness). These documents "work" in Opera and Firefox 3.6.x because those browsers heuristically detect BOMless UTF-16 (even non-Basic Latin).

 * If a heuristic detector is enabled, there's no higher-priority encoding source (incl. referring doc) and if the HTML document comes from a non-GET HTTP request (POST in particular), the first 1024 are shown to the detector and the detector is told that that's all it's going to get. If the detector makes a decision, that decision is committed to.

 * If a heuristic detector is enabled, there's no higher-priority encoding source (incl. referring doc) and if the HTML document comes from a GET HTTP request, the first 1024 and whatever bytes are in the buffer that completes the first 1024 bytes are shown to the heuristic detector. If that's enough for the detector to make a decision, the encoding is used tentatively without reloading the document in order to use it. A late <meta> can reverse the decision and cause a reload. If that's not enough for the detector to make a decision, incremental parsing and rendering starts, but the bytes arriving from the network are shown to the heuristic detection before pushing them to the tokenizer. If the heuristic detector makes a decision that differs from the tentative encoding already in use (the user's default encoding), the page is reloaded with the detected encoding. (I.e. there's a flash of misdecoded content, potentially scripts running twice, etc.) There's still the opportunity for a late <meta> to cause yet another reload.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 16 February 2011 11:11:18 UTC