[whatwg] Namespaces and tag names in the HTML parser from Ian Hickson on 2013-05-29 (public-whatwg-archive@w3.org from May 2013)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 29 May 2013 22:19:35 +0000 (UTC)
To: Adam Klein <adamk@chromium.org>, Rafael Weinstein <rafaelw@google.com>
Cc: whatwg@whatwg.org, William Chen <wchen@mozilla.com>, me@gsnedders.com, Henri Sivonen <hsivonen@gmail.com>
Message-ID: <Pine.LNX.4.64.1305292028010.2932@ps20323.dreamhostps.com>
On Wed, 27 Feb 2013, Adam Klein wrote:
>
> Consider the following script:
> 
> tr = document.createElement('tr')
> tr.innerHTML = '<math><tr><mo><td>';
> 
> That is, the fragment is parsed with tr as the context element. What 
> should the generated DOM be?

Up to the <td> it's unambiguous and uncontroversial, I hope; and should 
be:

   <html:tr>
    <math:math>
     <math:tr>
      <math:mo>

At the "<td>", you clear the stack back to a table row context, which pops 
all the nodes from the stack except the root one (the <html> one, 
representing the original <tr> element on which innerHTML was invoked).

It thus results in:

   <html:tr>
    <math:math>
     <math:tr>
      <math:mo>
    <html:td>


> Note that <mo> is a "MathML text integration point", which causes the 
> <td> to be processed not as foreign content but as a normal HTML token. 
> This leads to the following DOM in WebKit:
> 
> <tr>
>     <math math>
>         <math tr>
>             <math mo>
>     <td>
> 
> (the "math" prefixes denote that these are elements with the MathML 
> namespace.)

That is correct.


> In Gecko, I instead get:
>
> <tr>
>     <math math>
>         <math tr>
>             <math mo>
>             <td>

That is not.


> The spec for what should happen to that <td> is the first step of 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-intr
> 
> This case clearly seems like a bug in Gecko: it's treating the <math tr> 
> as if it's an HTML <tr>. That is, it's comparing only the local name (or 
> "tag name" as the spec usually refers to it).

Right, that's wrong. The spec isn't ambiguous here, it explicitly says 
that the current node must be a <tr> or <html> element, not an element 
with a "tr" or "html" tag name, and <tr> and <html> elements are in the 
HTML namespace (they're even hyperlinked to their definitions).


> But this same ambiguity exists elsewhere in the spec. For example, the 
> very next item under "in row" says "If the stack of open elements does 
> not have an element in table scope with the same tag name as the token" 
> (in this case, it's looking for a <tr>).

Yeah, that text is wrong, because part of the rules look for <*:tr>, and 
part assume that only <html:tr> was matched. In fact, it means that 
tr.innerHTML = '<math><tr><mo></tr>' has no parse error and pops the root 
<html> off the tree! That's clearly bogus.


> I think the HTML parser ought to specify more precisely how to deal with 
> namespaces in the stack of open elements, given that that stack can 
> contain elements of varying namespaces.

It's not so much that it has to do it precisely (it does), it's that it 
has to do it accurately...

There's a huge number of places in the spec that do tag name comparisons 
rather than element identity (tag+namespace) comparisons, and it's not at 
all clear to me that they should all change. Consider:

On Fri, 15 Mar 2013, Rafael Weinstein wrote:
>
> I just opened another similar bug: 
> https://www.w3.org/Bugs/Public/show_bug.cgi?id=21292 which has a similar 
> root cause.
> 
> I agree with Adam that it seems wrong that the stack of open elements 
> can contain elements in disparate namespaces, but its operation (at 
> times) only examines the local name (e.g. checking if an element is in a 
> specific scope, popping elements from the stack of open elements until 
> an element with the same tag name...)

Well, as noted in the bug, I don't think we should check the namespace in 
_every_ case. The case in the bug is this:

   <body><table><tr><td><svg><td><foreignObject></td>Foo<foo>

This is clearly invalid; the question is, what <td> did the author mean to 
match, if any? It makes sense to me to match the most recently one. In 
particular, consider these variations:

   <body><table><tr><td><svg><zz><foreignObject></td>Foo<foo>
   <body><table><tr><td><svg><zz><foreignObject></zz>Foo<foo>
   <body><table><tr><zz><svg><zz><foreignObject></zz>Foo<foo>



The cases in the spec now that are bogus are the cases where I mix one and 
the other. That actually means the opposite kind of change as is being 
proposed above: for example, it would mean changing the "table" end tag 
steps from what they say now (popping an HTML <table> element), to popping 
any "table" element regardless of namespace. This would make the algorithm 
more consistent, and remove the bugs mentioned above.

Is this what people want to do? It's not what you (Adam) implemented, as I 
understand it.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 29 May 2013 22:20:03 UTC