[whatwg] parsing nested forms from Tommy Thorsen on 2008-11-06 (public-whatwg-archive@w3.org from November 2008)

From: Tommy Thorsen <tommy@kvaleberg.com>
Date: Thu, 06 Nov 2008 10:28:59 +0100
Message-ID: <4912B8DB.3010706@kvaleberg.no>
Hi again!

Before I get to the real issue, I think I should give you a little bit 
of background. I'm working for a company which makes a web browser. 
We've been having some problems with our algorithm for parsing illegal 
html, so we decided to scrap the whole module and implement the 
algorithm exactly as outlined in the html5 spec. So far this has been a 
great success. We're already way better than we used to be, but there 
are some situations where the html5 parsing algorithm does not quite 
give us the result we expected.

Yesterday I noticed that we were not displaying the site 
http://bankrate.com correctly. The problem we had on that page boils 
down to the following markup:

<div id="firstdiv">
    A
    <div id="seconddiv">
        <form id="firstform">
            <div id="thirddiv">
                <form id="secondform"></form>
            </div>
        </form>
    </div>
    B
</div>

I'll walk you through it; Everything is normal until we reach the start 
tag for the "secondform". It is ignored, since we're already in a form 
(the form element pointer points to "firstform".) Then we see the end 
tag which was meant for "secondform". We pop elements from the stack of 
open elements until we find a form element (which is "firstform") 
popping off "thirddiv" in the process. The next token we get is the end 
div tag which was meant for "thirddiv". Since "thirddiv" is already 
gone, we pop "seconddiv" instead, and now we're sort of off-balance. The 
result is that A and B does not end up as children of the same div.

If any of you would like to see the effect this can have on a real page, 
you can use the parse.py script in html5lib. On a command line, use the 
following commands:

[tommy at tommyslaptop html]$ wget -k -O bankrate.html http://bankrate.com
[tommy at tommyslaptop html]$ /path/to/html5lib/python/parse.py 
bankrate.html > bankrate_parsed.html
[tommy at tommyslaptop html]$ firefox bankrate_parsed.html

I've applied a fix to our code which makes us handle this particular 
case better. I haven't tested it very thoroughly, but the change is to 
implement the 'An end tag whose tag name is "form"' section in "in body" 
as if it said:

------
An end tag whose tag name is "form"

    Let /node/ be the form element pointer
    Set the form element pointer to null.

    If the stack of open elements does not have an element in scope with 
the same tag name as that of the token, then this is a parse error; 
ignore the token.

    Otherwise, run these steps:

       1. Generate implied end tags.
       2. If the current node is not an element with the same tag name 
as that of the token, then this is a parse error.
       3. Remove /node/ from the stack of open elements
------

This seems to give us pretty much the same behaviour as Opera for the 
simple example above. Can any of you see any potential problems with 
this approach? In any case, I do believe that the specification needs to 
be changed one way or another, so that it handles this case better.

I think I have a couple of other instances where we've had to deviate 
from the specification in order to tackle problems discovered by our 
testers, and if any of you are interested in this kind of feedback, I'll 
dig them out and post them on this list.

Best regards
Tommy Thorsen
Received on Thursday, 6 November 2008 01:28:59 UTC