Putting limits on formatting element growth from Henri Sivonen on 2010-09-13 (public-html@w3.org from September 2010)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 13 Sep 2010 15:54:14 +0300
To: HTML WG <public-html@w3.org>
Message-Id: <DB00A67E-42A5-4FCE-80FF-007E3FEC01B8@iki.fi>

Currently, the way formatting elements is handled allows a site to cause the parser output to contain a dramatically larger number of elements than there are start tags. This isn't a problem most of the time. However, there are bad cases out there on the Web and it would be nice to make them not hang browsers and maybe do so in a consistent way across implementations.

So far, people have expressed the most concern about the adoption agency algorithm, because it has two nested loops, which suggests the time complexity can skyrocket. I have not come across Web pages where the AAA is a problem. However, WebKit's (new) implementation limits both loops to 10 iterations (that is, the inner loop is limited to 10 iterations per each entry to the loop). I think it would be useful to run an instrumented parser across a large dataset of Web pages to find out how many iterations of the outer loop and how many of the inner loop are typical and then maybe even standardize iteration limits for the loops.

A problem that I'm actually more concerned about is the list of active formatting elements getting longer and longer for each <font> tag on pages that open <font> but don't close them (or close them so late that problems occur). This leads to an excessive number of DOM nodes when reconstructing active formatting elements reconstructs and ever-increasing number of nodes. I'm inclined to think that we should make the parsing algorithm assume that if there have been more than n (a carefully chosen n, of course) entries in the list of active formatting elements after the last marker (or the start of the list if there's no marker), the parser is unlikely to ever see end tags for the formatting elements and should instead take defensive measures. I think an appropriate defensive measure would be modifying the handling of start tags that end up on the list to first search from the last marker (or the start of the list if there's no marker) forward for an entry that has the same tag name as the current token and attributes that are identical to the current token. If there's a match, remove the node that matches from the list of active formatting elements and the stack. (Then put the node created for the current token on the stack and on the list.)

This way, the next time the active formatting elements are reconstructed, only the most recent <font> tags would be reconstructed, since the earlier redundant <font>s would no longer be on the list of active formatting elements.

Then there's the issue of whitespace-only text nodes uselessly reconstructing the list of active formatting elements. I think changing this would be potentially more disruptive in terms of the code and test cases involved. Also, making the most common tree builder mode ("in body") sensitive to space characters vs. non-space characters would be a performance problem if implemented naively not optimized properly (to go on the fast track once one non-space character has been seen until a non-character token is seen). Still, having seen a page like http://saintjohnchurchmiddletown.com/default.aspx (warning! may slow down your browsing considerably while loading) in the wild, I have to wonder if we still should change the spec not to reconstruct the list of active formatting elements on space characters.

Opinions?

--
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Monday, 13 September 2010 12:54:50 UTC