Re: Putting limits on formatting element growth from Henri Sivonen on 2010-09-14 (public-html@w3.org from September 2010)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 14 Sep 2010 07:39:26 -0700 (PDT)
To: HTML WG <public-html@w3.org>
Message-ID: <717158777.567800.1284475166537.JavaMail.root@cm-mail03.mozilla.org>
> A problem that I'm actually more concerned about is the list of active
> formatting elements getting longer and longer for each <font> tag on
> pages that open <font> but don't close them (or close them so late
> that problems occur). This leads to an excessive number of DOM nodes
> when reconstructing active formatting elements reconstructs and
> ever-increasing number of nodes. I'm inclined to think that we should
> make the parsing algorithm assume that if there have been more than n
> (a carefully chosen n, of course) entries in the list of active

> formatting elements after the last marker (or the start of the list if
> there's no marker), the parser is unlikely to ever see end tags for
> the formatting elements and should instead take defensive measures. I
> think an appropriate defensive measure would be modifying the handling
> of start tags that end up on the list to first search from the last
> marker (or the start of the list if there's no marker) forward for an
> entry that has the same tag name as the current token and attributes
> that are identical to the current token. If there's a match, remove
> the node that matches from the list of active formatting elements and
> the stack. (Then put the node created for the current token on the
> stack and on the list.)
> 
> This way, the next time the active formatting elements are
> reconstructed, only the most recent <font> tags would be
> reconstructed, since the earlier redundant <font>s would no longer be
> on the list of active formatting elements.

I implemented the above suggestion for experimentation. Having experimented with it, I think we shouldn't adopt the above suggestion to the spec as is. It needs some refinement.

The first problem is that making the duplicate pruning kick in only once there are n items after the last marker is that it allows n duplicates on the list of active formatting elements, and it's pointless to allow many duplicates. For avoiding duplicates, it makes more sense to count duplicates. The obvious thing to try is to prevent duplicates (after the last marker) altogether as with <a>. The problem with doing that while keeping the suggestion otherwise intact is that there are legitimate uses for nesting <em> in an <em> (or <i> in an <i>): nested emphasis where the inner emphasis unitalicizes text. I modified things so that two identical elements (same name and same attributes) are allowed after the last marker on the list of active formatting elements.

This modifications is already pretty nice, but it's not quite satisfactory, because it doesn't adhere to the principle of least surprise. E.g. <p><b><b><b>t</b>t</b>t</b>t already starts behaving in a weird way, because the first <b> has gotten removed from the stack, so its would-be child text becomes a child of the p. Ideally, the limits would only apply to the list of formatting elements so that fewer formatting elements get reconstructed but without removing elements from the stack when we are dealing with real start tags and not reconstructions. The obvious step to take is not to remove the old duplicate formatting element from the stack when removing it from the list.

This poses a new problem: The AAA is unable to close elements that are on the stack but not on the list. Fortunately, fixing this appears to be relatively simple. In step #1 of the AAA, if the first "If there is no such node" check is true, abort the AAA and process the token according to the rules for "any other end tag token".

With this modification, real tags behave in an unsurprising way, but the size of the list of active formatting elements doesn't grow when the page has redundant formatting tags, so substantially fewer formatting elements get reconstructed. This makes even pathological cases like http://saintjohnchurchmiddletown.com/default.aspx fast even when reconstructing the active formatting elements on space characters. (I still thing we should *also* not reconstruct on space.)

This solution is so unobtrusive that it only makes one html5lib test case fail:
<cite><b><cite><i><cite><i><cite><i><div>X</b>TEST 
Even test cases that test this stuff don't usually contain more than 2 nested identical elements!

It's possible that I'm missing something and this would break things horribly. Also, I realize it's very late in the development cycle of both implementations and the spec itself. However, if I'm not missing some big flaw in this, I think it would be worthwhile to implement this modified scheme for limiting the growth of the list of active formatting elements.

(With this scheme, it's still possible to DoS the HTML5 parser by generating a lot of formatting elements with unique attributes. The purpose of this scheme is to make real-world repetition work faster.)

What do other implementors think?

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Tuesday, 14 September 2010 14:40:00 UTC