W3C home > Mailing lists > Public > whatwg@whatwg.org > April 2013

Re: [whatwg] HTML parsing, the stack of open elements, and foreign content

From: Adam Klein <adamk@chromium.org>
Date: Tue, 2 Apr 2013 12:19:08 -0700
Message-ID: <CAEvLGcJsXE5e4aFpWt73ks80ORqFAzjg+yrjRkOPGhvYbgLKHg@mail.gmail.com>
To: Rafael Weinstein <rafaelw@google.com>
Cc: William Chen <wchen@mozilla.com>, WHATWG List <whatwg@whatwg.org>, Ian Hickson <ian@hixie.ch>, me@gsnedders.com, Henri Sivonen <hsivonen@gmail.com>
Since I haven't heard any talk on this thread (or on the w3.org bug),
I've landed a patch in WebKit to treat tokens being processed in HTML
as if they had an HTML namespace (http://trac.webkit.org/r147441). My
reason for landing was that we've already seen two crash bugs due to
the WebKit parser getting into a bad state WRT the stack of open
elements, and I'd rather not leave us open to more of the same.

Note that this change passes all existing html5lib tests. I added one
test case, which came (slightly modified) from Rafael's bug:

<body><table><tr><td><svg><td><foreignObject><span></td>Foo

which is now parsed as:

| <html>
|   <head>
|   <body>
|     "Foo"
|     <table>
|       <tbody>
|         <tr>
|           <td>
|             <svg svg>
|               <svg td>
|                 <svg foreignObject>
|                   <span>

where previously (and in current Firefox) it's parsed as:

| <html>
|   <head>
|   <body>
|     <table>
|       <tbody>
|         <tr>
|           <td>
|             <svg svg>
|               <svg td>
|                 <svg foreignObject>
|                   <span>
|               "Foo"

That is, the </td> is being parsed as HTML (thanks to
<foreignObject><span>), so it searches on the stack for an HTML td to
close. There are probably a whole set of similar test cases, but they
can be tricky to construct thanks, in part, to the various "escape
hatches" from an HTML integration point (including <p>, <table>, and
many more).

I think the equivalent spec change would be to spell out in detail
what it means for a token or element to match something on the stack
of open elements. The new WebKit behavior seems more proper to me (and
seemed reasonable to those I could raise on #whatwg a few days ago); I
also think it's unlikely to affect much real content, so changing it
to make the parser's internal state more sane is worthwhile.

- Adam

On Fri, Mar 15, 2013 at 10:31 AM, Rafael Weinstein <rafaelw@google.com> wrote:
> I just opened another similar bug:
> https://www.w3.org/Bugs/Public/show_bug.cgi?id=21292 which has a
> similar root cause.
>
> I agree with Adam that it seems wrong that the stack of open elements
> can contain elements in disparate namespaces, but its operation (at
> times) only examines the local name (e.g. checking if an element is in
> a specific scope, popping elements from the stack of open elements
> until an element with the same tag name...)
>
> On Wed, Feb 27, 2013 at 12:39 PM, Adam Klein <adamk@chromium.org> wrote:
>> Consider the following script:
>>
>> tr = document.createElement('tr')
>> tr.innerHTML = '<math><tr><mo><td>';
>>
>> That is, the fragment is parsed with tr as the context element. What
>> should the generated DOM be? Note that <mo> is a "MathML text
>> integration point", which causes the <td> to be processed not as
>> foreign content but as a normal HTML token. This leads to the
>> following DOM in WebKit:
>>
>> <tr>
>>     <math math>
>>         <math tr>
>>             <math mo>
>>     <td>
>>
>> (the "math" prefixes denote that these are elements with the MathML
>> namespace.) In Gecko, I instead get:
>>
>> <tr>
>>     <math math>
>>         <math tr>
>>             <math mo>
>>             <td>
>>
>> Note that the <td> in both cases is an HTML element, even though in
>> Gecko it's in a MathML tree.
>>
>> The spec for what should happen to that <td> is the first step of
>> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-intr
>>
>> This case clearly seems like a bug in Gecko: it's treating the <math
>> tr> as if it's an HTML <tr>. That is, it's comparing only the local
>> name (or "tag name" as the spec usually refers to it).
>>
>> But this same ambiguity exists elsewhere in the spec. For example, the
>> very next item under "in row" says "If the stack of open elements does
>> not have an element in table scope with the same tag name as the
>> token" (in this case, it's looking for a <tr>).
>>
>> I think the HTML parser ought to specify more precisely how to deal
>> with namespaces in the stack of open elements, given that that stack
>> can contain elements of varying namespaces.
>>
>> - Adam
Received on Tuesday, 2 April 2013 19:19:36 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:59:57 UTC