Running heuristic encoding detection from Henri Sivonen on 2011-01-03 (public-html@w3.org from January 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 3 Jan 2011 12:36:14 +0200
To: HTML WG <public-html@w3.org>
Message-Id: <BE280A04-8451-44ED-940B-1DD0FA4124CA@iki.fi>

When a heuristic encoding detector is enabled, Firefox < 4 runs the detector alongside the parser even after script execution and incremental display have started and potentially reloads the page. This is an unsatisfactory solution, because it means that scripts with side effects can run twice depending on user-exposed configuration parameters (whether detection has been enabled). (Well, strictly speaking what HTML5 prescribes about <meta> also depends on user-exposed configuration because a late <meta>-based reload is unnecessary when the user's default encoding happens to match the encoding declared in <meta>.)

The current development versions of Firefox 4 run the heuristic detector on the first 1024 bytes of input before starting the parse. Since the <meta> prescan also runs of the same 1024 bytes, enabling or disabling the heuristic detector doesn't cause worse effects on script execution than the user choosing a different default encoding.

In my cursory testing a couple of years ago, I concluded that running the detector on the first 1024 bytes worked well enough in the real world. Not it has been suggested that it doesn't work well enough:
https://bugzilla.mozilla.org/show_bug.cgi?id=620106

At present, I don't have enough data to quantify the merit of the allegation of 1024 bytes not being enough, but for now I have my doubts given that my earlier cursory testing suggested it was enough.

Interestingly, information provided by the bug reporter suggests that https://bugzilla.mozilla.org/show_bug.cgi?id=620106#c12 that Opera, Chrome and IE8 modify their incrementalism behavior when the heuristic detector is enabled. Before I write a set of test cases and start poking the other browsers, I was wondering if someone on this list can already confirm the nature of the behavior.

Do the other browsers indeed not start parsing until the heuristic detector has committed to an encoding?
Is there a timeout or a max number of bytes as a cutoff after the browser commits to an encoding even though the heuristic detector hasn't yet made its decision?
Do other browsers default to heuristic detection enabled in any locale? (Since having different buffering behavior would likely lead to different perf characteristics, it would seem interesting to ship with different perf on a per-locale basis.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Monday, 3 January 2011 10:36:51 UTC