Re: several messages about the tree construction stage of HTML parsing from Ian Hickson on 2008-03-05 (public-html@w3.org from March 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 5 Mar 2008 07:45:57 +0000 (UTC)
To: "whatwg@whatwg.org List" <whatwg@whatwg.org>, "public-html@w3.org WG" <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0803030005200.6407@hixie.dreamhostps.com>
Executive summary:

Hoo boy did these e-mails end up with a lot of complicated changes -- a 
total of 38 different checkins. Some were editorial, but others were quite 
invasive changes. Here are the main ones:

 * Merged insertion modes and phases. Theoretically this didn't change 
   anything but as I'm sure I broke many things, please let me know what I 
   should fix. Note that I did the changes in steps, to make reviewing the 
   changes easier. (r1308-r1313)

 * Make a text node before an <html> start tag be a parse error. (r1314)

 * Strip spaces between <html> and <head> tags. (r1332)

 * Make <title> not get moved to <head>. (r1328)

 * Fix the processing of </form> to work like </div>. (r1320)

 * Make <listing> parse like <pre> in terms of leading newlines. (r1330)

 * Make spaces in <table> get reparented if there's been other content 
   that got reparented. (r1326)

 * Make <input type="hidden"> in <table> not get reparented unless 
   something else has been reparented. (r1331)

 * Make <style> and <script> in <table> not get reparented. (r1335)

 * Make <select> inside <table> get handled compatibly. (r1342)

 * Fix bugs with the AAA algorithm in tables. (r1343)

For the full list see the diffs from r1308 to r1346.


On Thu, 19 Jan 2006, Simon Pieters wrote:
> 
> I did some more testing[1] on this. It seems that in Mozilla and Opera, 
> the DOM tree for...
> 
>   <table><tbody><form><tr><td><input></td></tr></form></tbody></table>
> 
> ...looks like this:
> 
> TABLE
> - TBODY
> - - FORM
> - - TR
> - - - TD
> - - - - INPUT
> 
> In IE6 the DOM tree looks like this:
> 
> TABLE
> - TBODY
> - - FORM
> - - - TR
> - - - - TD
> - - - - - INPUT
> 
> Even though the FORM is not an ancestor to the INPUT in the DOM tree in 
> Mozilla or Opera, the DOM2 HTML attributes .form and .elements work as 
> if it was. The same holds true when <form> is placed as a child of 
> TABLE, TBODY (presumably also THEAD and TFOOT), and TR.
> 
> [1] http://zcorpan.1go.dk/test/html/table-form/

The problem with IE's tree is that it is incompatible with the CSS tabel 
model. The problem with the Mozilla and Opera tree is that the form is 
still in the way, but since it's not an ancestor, it doesn't really have 
to be there at all. Hence the HTML5 processing.


On Fri, 29 Jun 2007, Henri Sivonen wrote:
>
> If the spec dealt with the "html" start tag token directly in the root 
> element phase, the parse error in the main phase wouldn't need to be 
> conditional. (Implementations that experience a perf benefit from not 
> mutating the attributes of a node probably want to hoist the "html" node 
> creation to the root element phase for perf reasons, too.)

Done.


On Mon, 11 Feb 2008, Philip Taylor wrote:
> 
> There's also an issue with:
> 
>   <!doctype html>
>   foo
>   <html>
> 
> not producing any parse error, because the <html> is the first start tag 
> token (at least under my interpretation) and therefore is considered 
> valid. Handling <html> specially in the root element phase seems like a 
> reasonable way of fixing this.

Fixed.


On Sat, 30 Jun 2007, Henri Sivonen wrote:
>
> Under "before head" the case 'A start tag token whose tag name is one 
> of: "base", "link", "meta", "script", "style", "title"' is the same as 
> any other start tag token.

Removed.


On Sun, 1 Jul 2007, Henri Sivonen wrote:
> 
> In the tree construction part of the parsing algorithm, void elements 
> that are generally supposed to be children of the <head> element never 
> get popped. (Void elements that are generally supposed to be descendants 
> of the <body> element are appropriately popped immediately.)

Fixed. (Assuming this only affected "meta", "base", and "link" while in 
head, anyway.)


On Sun, 1 Jul 2007, Henri Sivonen wrote:
> 
> Please add a note that says that the "context" concept of the [R]CDATA 
> algorithm causes <title> to be moved to <head> here.

No longer applicable (<title> is no longer reparented).


On Sun, 1 Jul 2007, Henri Sivonen wrote:
> 
> In the tree construction part of the parsing algorithm, the rationale 
> for formulating the generic [R]CDATA parsing algorithm the way it is 
> formulated is not given. The formulation is unusual compared to the rest 
> of the chapter, so it is reasonable to expect that there's a specific 
> reason why it is written the way it is written.

It was written the way that my mind thought about it...


> My practical concern is this:
>
> In my implementation the tokenizer owns the main processing loop. 
> Therefore, the tree builder can only change its state on a per-token 
> basis and cannot pull another token in response to processing one token. 
> (Instead, it can set its own flags, return control to the tokenizer and 
> wait for the tokenizer to call back into the tree builder again.)
>
> I have solved the problem as follows:
> 
> cdataOrRcdataTimesToPop is initialized to 0.
> 
> When the spec invokes the generic [R]CDATA parsing algorithm, instead of
> running it, do the following:
> 1. If the context node is the current node,
>  1a. Create an element for the token.
>  1b. Push the element.
>  1c. Set the content model flag of the tokenizer.
>  1d. Set cdataOrRcdataTimesToPop to 1.
> 2. Otherwise, if the context node is not the current node,
>  2a. Push the context node.
>  2b. Create an element for the token.
>  2c. Push the element.
>  2d. Set the content model flag of the tokenizer.
>  2e. Set cdataOrRcdataTimesToPop to 2.
> 
> Modify the processing of character tokens and end tag tokens as follows:
> 
> 3. If a character token is seen and cdataOrRcdataTimesToPop > 0,
>  3a. Append the character token to the current node.
>  3b. Omit the normal processing of character tokens.
> 4. If an end tag token is seen and cdataOrRcdataTimesToPop > 0,
>  (The token will always be the end tag for the [R]DATA element.)
>  4a. Pop cdataOrRcdataTimesToPop times.
>  4b. Set cdataOrRcdataTimesToPop to 0.
>  4c. Omit normal end tag token processing.
> 
> I'd like to know if this transformation breaks some important property 
> caused by the formulation of the spec.

As far as I can tell it is equivalent.


> Specifically, the spec says:
> > 7. If the next token is an end tag token with the same tag name as the start
> > tag token, ignore it. Otherwise, this is a parse error.
> 
> How could you see any other token but an end tag token with the same tag 
> name as the start tag token, a character token or EOF?

The only way it can't be the end tag token is if it is an EOF token, I 
believe. I've made the spec say that.


On Sun, 1 Jul 2007, Henri Sivonen wrote:
> 
> The cases
> > A start tag whose tag name is one of: "address", "blockquote", "center",
> > "dir", "div", "dl", "fieldset", "listing", "menu", "ol", "p", "ul"
> and
> > A start tag whose tag name is one of: "h1", "h2", "h3", "h4", "h5", "h6"
> have the same action and can be unified.

Done.


On Mon, 2 Jul 2007, Henri Sivonen wrote:
> 
> The handling of "select" "in body" does not associate the node with the 
> current form pointer. Is this intentional?

Fixed.


On Sat, 14 Jul 2007, Henri Sivonen wrote:
> 
> The same goes for "button".

Fixed.


On Wed, 4 Jul 2007, Henri Sivonen wrote:
> > 
> > A start tag whose tag name is "frameset"
> > 
> >     Insert a frameset element for the token.
> 
> There seems to be nothing special about inserting a frameset element. I 
> suggest using the normal wording "Insert an HTML element for the token."

Done.


On Thu, 5 Jul 2007, Henri Sivonen wrote:
> 
> For some elements the spec says:
> > If the stack of open elements has an element in scope with the same 
> > tag name as that of the token, then pop elements from this stack until 
> > an element with that tag name has been popped from the stack.
> 
> But for (at least) the p element it says:
> > If the stack of open elements has a p element in scope, then pop 
> > elements from this stack until the stack no longer has a p element in 
> > scope.
> 
> The different wording implies that the implementation needs to run a 
> different subalgorithm. However, this is not the case. Therefore, it 
> would be better to use the first wording in both cases.

When I use the second wording quoted above, it is sometimes the case that 
the token is not the same as the element being sought.

If there are specific cases where I could make the spec more consistent 
without loss of precision, please let me know.


On Thu, 5 Jul 2007, Henri Sivonen wrote:
> 
> For some elements the spec says:
> > If the stack of open elements has an element in scope with the same tag name
> > as that of the token, then generate implied end tags.
> > 
> > Now, if the current node is not an element with the same tag name as that of
> > the token, then this is a parse error.
> > 
> > If the stack of open elements has an element in scope with the same tag name
> > as that of the token, then pop elements from this stack until an element
> > with that tag name has been popped from the stack.
> 
> I can't figure out you one might get the stack in such a state that the 
> "generate implied end tags" step changed the situation so that the 
> second "If the stack of open elements has an element in scope" found a 
> different node than the first "If the stack of open elements has an 
> element in scope".
> 
> Am I right? If yes, it would make sense to write this is a way that 
> doesn't suggest that implementors search the stack twice.

Done.


On Thu, 5 Jul 2007, Henri Sivonen wrote:
> 
> The spec says:
> > When the steps below require the UA to generate implied end tags, then, if
> > the current node is a dd element, a dt element, an li element, a p element,
> > a tbody element, a td element, a tfoot element, a th element, a thead
> > element, a tr element, the UA must act as if an end tag with the respective
> > tag name had been seen and then generate implied end tags again.
> 
> As far as I and, judging from source comments and IRC remarks, the 
> authors of html5lib can tell, "act as if an end tag with the respective 
> tag name had been seen" simplifies to "pop the current node off the 
> stack of open elements". If this is indeed the case, the spec should 
> just say so. If this is not the case, the difference appears to be too 
> subtle for implementors and should be called out explicitly.

Done.



On Thu, 5 Jul 2007, Henri Sivonen wrote:
> 
> The spec says:
> > When the steps below require the UA to generate implied end tags, 
> > then, if the current node is a dd element, a dt element, an li 
> > element, a p element, a tbody element, a td element, a tfoot element, 
> > a th element, a thead element, a tr element, the UA must act as if an 
> > end tag with the respective tag name had been seen and then generate 
> > implied end tags again.
> 
> The table-related elements got in there a while ago: 
> http://html5.org/tools/web-apps-tracker?from=964&to=965
> 
> As far as Anne and I can tell, the implied end tags of the table-related 
> elements are already handled regardless of "generate implied end tags". 
> http://krijnhoetmer.nl/irc-logs/whatwg/20070705#l-106
> 
> If the table-related elements are indeed not needed on the list, they 
> should be taken out.

Done.


On Fri, 6 Jul 2007, Henri Sivonen wrote:
> 
> On Jul 5, 2007, at 10:57, Henri Sivonen wrote:
> 
> > If the table-related elements are indeed not needed on the list, they should
> > be taken out.
> 
> Doing this would simplify the implied end tag generation when a table 
> cell closes, because the exclusion of the table cell itself would be 
> unnecessary.

Removed.


On Thu, 5 Jul 2007, Henri Sivonen wrote:
> 
> The spec says:
> > An end tag whose tag name is "p"
> > 
> >     If the stack of open elements has a p element in scope, then generate
> > implied end tags, except for p elements.
> 
> Anne pointed out (correctly, I think) that the "generate implied end 
> tags, except for p elements" is always a no-op. 
> http://krijnhoetmer.nl/irc-logs/whatwg/20070705#l-156

Removed.


On Thu, 5 Jul 2007, Henri Sivonen wrote:
> 
> The spec says:
> > An end tag token not covered by the previous entries
> > 
> >     Run the following algorithm:
> > 
> >        1. Initialise node to be the current node (the bottommost node of the
> >           stack).
> >        2. If node has the same tag name as the end tag token, then:
> >              1. Generate implied end tags.
> >              2. If the tag name of the end tag token does not match the tag
> >                 name of the current node, this is a parse error.
> >              3. Pop all the nodes from the current node up to node,
> >                 including node, then stop this algorithm.
> >        3. Otherwise, if node is in neither the formatting category nor the
> >           phrasing category, then this is a parse error. Stop this 
> >           algorithm. The end tag token is ignored.
> >        4. Set node to the previous entry in the stack of open elements.
> >        5. Return to step 2.
> 
> The sublist doesn't make sense. If the current node has the same tag 
> name as the token, the stack should be popped. Generating implied end 
> tags makes no sense.

It's a loop. You return to step 2 after climbing up the chain.


> The algorithm should probably read as follows:
> 1. If the current node has the same tag name as the end tag token, pop the
> current node off the stack and then stop this algorithm.
> 2. Generate implied end tags.
> 3. If the current node has the same tag name as the end tag token, pop the
> current node off the stack and then stop this algorithm.
> 4. If the current node is in the formatting category or in the phasing
> category, then this is a parse error. Pop the current node off the stack and
> then return to step 2.
> 5. Otherwise, stop this algorithm.

This algorithm does something different than what we want if the end tag 
doesn't match anything.


> Note that both formulations seem to make a stray </td> in "in body" not 
> to be silently ignored by as closing open formatting or phrasing 
> elements. Is this right? Should popping phrasing or formatting elements 
> first check if there's an element in scope with the same tag name as the 
> token?

I have no idea what you're trying to say here, sorry.


On Fri, 6 Jul 2007, Henri Sivonen wrote:
> 
> The spec says:
> > An end tag whose tag name is "table"
> > 
> >     If the stack of open elements does not have an element in table scope
> > with the same tag name as the token, this is a parse error. Ignore the
> > token. (fragment case)
> 
> "in table scope" is redundant and could be struck.

How so?

<table><marquee></table>


On Fri, 6 Jul 2007, Henri Sivonen wrote:
> The spec says:
> > An end tag whose tag name is "table"
> > 
> >     If the stack of open elements does not have an element in table scope
> > with the same tag name as the token, this is a parse error. Ignore the
> > token. (fragment case)
> > 
> >     Otherwise:
> > 
> >     Generate implied end tags.
> > 
> >     Now, if the current node is not a table element, then this is a parse
> >     error.
> > 
> >     Pop elements from this stack until a table element has been popped from
> >     the stack.
> > 
> >     Reset the insertion mode appropriately.
> 
> Why have the steps
> >     Generate implied end tags.
> > 
> >     Now, if the current node is not a table element, then this is a parse
> >     error.
> there at all?
> 
> If there are implied end tags to generate, it means that the stack has 
> foster-parented stuff pushed onto it. The foster parenting triggered an 
> error when the nodes got onto the stack. Do we really care about how 
> gracefully they come off the stack?

On Fri, 6 Jul 2007, Henri Sivonen wrote:
> 
> On the other hand, perhaps we do care about <table><div></table> being
> subjectively even worse than <table><div></div></table> but
> <table><dd></table> and <table><dd></dd></table> being equally bad.
> 
> Anyway, this isn't the only case where foster-parented subtrees come off the
> stack.

Yeah, for consistency I guess it makes no sense for <table><p><i><tr> to 
be fewer errors than <table><p><i></table>, and the latter is easier to 
reduce to one error than the former to increase to two...

In fact, I've removed all those errors. I agree with you that it makes 
more sense to simply consider the foster parenting the reportable error, 
and forget the other errors.


On Fri, 6 Jul 2007, Henri Sivonen wrote:
> 
> The spec says:
> > If node is the first node in the stack of open elements, then set last to
> > true. If the context element of the HTML fragment parsing algorithm is
> > neither a td element nor a th element, then set node to the context element.
> > (fragment case)
> 
> The second sentence is also qualified by the first "If", right?

Clarified.


On Tue, 10 Jul 2007, Henri Sivonen wrote:
> 
> The tree construction section uses both "insert" and "append". Most 
> often "insert" means just "append". For clarity, it would be better to 
> use "append" when a node is appended as the new last child of a parent 
> and "insert" when the node is inserted before an existing child.
> 
> I do realize that "append" is just a special form of "insert", but with 
> "insert" one always has to stop and look if there's a reference node 
> mentioned.

The "insert an HTML element" steps turn into an actual insertion when 
there's a <table> involved, so I'd rather not change this. You end up 
having to say things like "except when you would normally append, insert 
instead", which makes the whole thing somewhat of a farce.


On Tue, 10 Jul 2007, Henri Sivonen wrote:
> 
> Since space characters are unconditionally not foster parented, the 
> results of the tree builder are inconsistent with Gecko and WebKit DOM 
> and rendering and with Presto rendering. If you have the character 
> tokens "foo bar" and the current node is "table", "foobar" get foster 
> parented and " " doesn't according to the current draft. WebKit and 
> Gecko foster parent "foo bar". Presto seems to achieve results that 
> render similarly except the space renders as a line break (by foster 
> parenting in the CSS box tree?).

Fixed.


On Thu, 12 Jul 2007, Jonas Sicking wrote:
> Ian Hickson wrote:
> > On Wed, 20 Jun 2007, Anne van Kesteren wrote:
> > > This also applies to the <title> element in Opera 9. Internet 
> > > Explorer 7 always drops the <title> element from the DOM. The first 
> > > <title> in document order (depth-first) is equal to document.title.
> > 
> > IE7's behaviour looks like what the spec says now, and what the spec 
> > says now matches Firefox and Safari, so I'd rather not change it.
> 
> We're actually planning on changing our behavior here to not move 
> <title> elements into the head. I really doubt that there is code out 
> there that depends on the <title> element appearing in the head since 
> there is little reason for current web authors to muck around with the 
> <title> element at all given that it doesn't do anything once it's 
> parsed. (this is IMHO a bug in current implementations).

Done.


On Fri, 13 Jul 2007, Henri Sivonen wrote:
> 
> There's an entry:
> > An end tag whose tag name is one of: "area", "basefont", "bgsound", "br",
> > "embed", "hr", "iframe", "image", "img", "input", "isindex", "noembed",
> > "noframes", "param", "select", "spacer", "table", "textarea", "wbr"
> 
> "br" should be struck there as it is already handled the way it should 
> in:
>
> > An end tag whose tag name is "br"

Fixed.


On Sun, 15 Jul 2007, Simon Pieters wrote:
> 
> Compare:
> 
>    http://www.whatwg.org/specs/web-apps/current-work/#the-initial
>    http://developer.mozilla.org/en/docs/Mozilla's_DOCTYPE_sniffing
> 
> The following cases are missing in the list of conditions that trigger quirks
> mode:
> 
>   * The public identifier is set to: "-//SoftQuad Software//DTD HoTMetaL
>     PRO 6.0::19990601::extensions to HTML 4.0//EN"
>   * The public identifier is set to: "-//SoftQuad//DTD HoTMetaL PRO
>     4.0::19971010::extensions to HTML 4.0//EN"

Fixed this as part of fixes last week.


On Wed, 25 Jul 2007, Simon Pieters wrote:
> 
>    http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C%21DOCTYPE%20html%3E%3Cp%3E%3Ctable%3E
>    http://software.hixie.ch/utilities/js/live-dom-viewer/?%3Cp%3E%3Ctable%3E
> 
> Before Acid2, AFAIK, only Mozilla parsed <p><table> as <p></p><table>, 
> and only in standards mode.
> 
> Now, as a result of Acid2, Opera does the same as Mozilla, and Safari 
> too but also in quirks mode.
> 
> I think having parsing differences between quirks mode and standards 
> mode is a bad thing. If the quirks mode behavior is required for compat 
> (which it probably is), then I think we should always parse it the 
> traditional way. Doing so would also align with IE7.

Well, has Safari run into any problems?


> Thus I suggest that the first paragraph be dropped in:
> 
>    A start tag whose tag name is "table"
>       If the stack of open elements has a p element in scope, then act as
>       if an end tag with the tag name p had been seen.
> 
>       Insert an HTML element for the token.
> 
>       Change the insertion mode to "in table".

I'm reluctant to make this change in no-quirks mode, as <table> really 
isn't a phrasing element, the way we've defined it.

Mind you, <section> right now doesn't imply a </p> either... maybe we 
should keep it that way, and make <table> join it in the weird parsiness 
that is HTML?


On Sat, 18 Aug 2007, Philip Taylor wrote:
> 
> The HTML5 parser ignores a leading line feed character in <pre> and 
> <textarea>. Current browsers do things differently in some cases.
> 
> IE 7: <pre>, <textarea>, <xmp>, <listing>, <plaintext> (and other 
> elements like <div> are totally weird, particular when comparing the 
> four different views and when adding CSS, and also it differs in quirks 
> vs standards mode, and also it sometimes differs when loading a real 
> page vs loading with document.write)

Ok, let's ignore that then.


> Firefox 2: <pre>, <textarea>, <listing>
> 
> Opera 9.2: <pre>, <listing> (and <textarea> seems to be handled outside 
> the parser)
> 
> Safari 3: <pre>, <listing> (and <textarea> seems to be handled outside 
> the parser)
> 
> HTML5: <pre>, <textarea>
> 
> (I didn't look at any elements other than those in the above example.)

HTML5 seems like the best compromise so far then. :-)


> In particular, <listing> is consistent between all browsers but missing 
> from HTML5.

I've added it to the list for HTML5.


> IE's behaviour seems slightly useful for <xmp> and <plaintext>, since it 
> lets you write
> 
>   <xmp>
>   <!DOCTYPE HTML>
>   <title>An example HTML5 document</title>
>   <p>...
>   </xmp>
> 
> and not get an unexpected blank line at the top.

Since neither of those elements has been part of HTML for years, I'm not 
really that concerned...


On Tue, 21 Aug 2007, Simon Pieters wrote:
> >
> > If we want to be more like IE, then I'd suggest the following spec 
> > text:
> > 
> >     The head element of a document is the first head element that doesn't
> >     have a body [or frameset] element ancestor.
> > 
> >     The title element of a document is the first title element that doesn't
> >     have a body [or frameset] element ancestor.
> 
> However, IE moves title elements found in body to head in the HTML 
> parser. HTML5 says to do this as well, currently. But if we want to 
> change the parser to not move title elements to head [1], then the 
> definition might instead need to be:
> 
>    The title element of a document is the first title element.
> 
> ...because some pages might well have <title> tags in body and expect 
> them to work as titles.

Done.


> If we do this, "the head element" doesn't seem 
> to be needed at all.

It's sadly still needed for when we have elements between </head> and 
<body>.


On Fri, 7 Sep 2007, Henri Sivonen wrote:
> On Sep 5, 2007, at 05:36, Ian Hickson wrote:
> > On Sun, 2 Sep 2007, Boris Zbarsky wrote:
> > > 
> > > What we are considering doing right now is allowing <input 
> > > type="hidden"> (but not other kinds of inputs) to be direct children 
> > > of TABLE, TBODY, and TR.
> > 
> > This would be the first step down the slippery slope of making the 
> > tree construction stage in the parser generate DOMs that depend on 
> > attribute values rather than being exclusively based on tag names.
> 
> I agree on the slope being slippery but...
> 
> > This scares me, as it immediately precludes a significant possible set 
> > of optimisations.
> 
> What kind of optimizations?

Well, for instance, parsers that don't care about attributes could throw 
them entirely on the floor before. With this change, you'd always need to 
parse attributes properly to get the right DOM.


> > > Thoughts?
> > 
> > Would it be possible to instead make the <input> elements remember 
> > their parse order and use that during submission?
> 
> This would add more secret data (data not exposed through the API) to 
> the DOM nodes. Now the only anomalies are the document mode and the form 
> pointer. Do we really want to go down the road of adding more?

I guess not.


On Wed, 12 Sep 2007, Maciej Stachowiak wrote:
> 
> For what it's worth, we also ran into a regression in WebKit when we 
> stopped allowing <input type="hidden"> to be a direct child of table 
> structure elements, on a different site than the one mentioned in the 
> Mozilla bug. We will likely restore that parsing quirk, and I think it 
> would be good to make the spec require it.

Right then. Based on all the above feedback, I've made it basically act 
exactly as the space characters now do. That is, <input type=hidden> 
elements stay inside <table> elements unless some content has been 
foster-reparented, at which point any future <input type=hidden>s for that 
table get reparented too.

I'm doing more research on this, but what I really would like is some 
feedback from browser vendors once they have tried implementing this.


On Thu, 11 Oct 2007, Thomas Broyer wrote:
>
> The algorithm says to "insert an HTML element for the token" when 
> encountering a start tag whose tag name is one of "meta", "link" or 
> "base"; but unlike other void elements, it doesn't say to "immediately 
> pop the current node off the stack of open elements", which means 
> following elements should be appended as children of the link, base or 
> meta element, not a sibling.

Fixed.


> For reference, html5lib and Validator.nu's HTML Parser currently don't 
> use the "insert an HTML element" algorithm for these cases, they instead 
> use a special algorithm which doesn't deal with the "stack of open 
> elements" at all. HTML Parser uses its 
> "appendVoidElementToCurrentMayFoster" algorithm for every void element 
> while html5lib follows the "insert then pop" algorithm for these (img, 
> hr, etc)

Either is conforming, yes. The effect is the same.


On Wed, 14 Nov 2007, Simon Pieters wrote:
> 
> It appears that some sites use document.documentElement.firstChild and 
> expect it to be the head element even if there was whitespace before the 
> <head> tag.
> 
> Safari seems to drop whitespace before the <head> tag. Firefox seems to 
> insert any whitespace before the <head> tag in the head element, even if 
> it was also before the <html> tag or before the doctype, and it also 
> does so with comments. IE seems to drop whitespace and put comments 
> before the <html> tag as siblings to the root element and after as 
> children of the head element. Opera currently does what HTML5 says, and 
> it has caused some compat problems.

Sigh. I really wanted to roundtrip all spaces... Oh well.

Fixed, by dropping spaces before <head>, and encouraging newlines after 
the <html> start tag.


On Mon, 4 Feb 2008, Philip Taylor wrote:
> 
> 'Main page ↪ Anything else ↪ If the insertion mode is "in body" ↪ 
> A start tag whose tag name is "li"' says:
> 
>   "Finally, insert an li element."
> 
> 'Main page ↪ Anything else ↪ If the insertion mode is "in body" ↪ 
> A start tag whose tag name is one of: "dd", "dt"' says:
> 
>   "Finally, insert an HTML element with the same tag name as the token's."
> 
> Both of those sentences link to the definition of "insert an HTML 
> element for a token", but neither mention a token to insert the element 
> for. I assume it's meant to simply insert an element for the current 
> token, and the bits about names are merely to add confusion, in which 
> case both sentences should be changed to:
> 
>   "Finally, insert an HTML element for the token."
> 
> to make it clear that nothing special is happening.

Ironically, the text was originally intend to make it clear that nothing 
special is happening!

Changed as suggested.


On Fri, 8 Feb 2008, Philip Taylor wrote:
> 
> http://html5lib.googlecode.com/svn/trunk/testdata/tree-construction/tests1.dat
> has the following test case:
> 
> #data
> <b>Test</i>Test
> #errors
> Line: 1 Col: 3 Unexpected start tag (b). Expected DOCTYPE.
> Line: 1 Col: 11 End tag (i) violates step 1, paragraph 1 of the adoption
> agency algorithm.
> Line: 1 Col: 15 Expected closing tag. Unexpected end of file.
> #document
> | <html>
> |   <head>
> |   <body>
> |     <b>
> |       "TestTest"
> 
> The text-node coalescence is defined in
> http://www.w3.org/html/wg/html5/#append as:
> 
>     "When the steps below require the UA to append a character to a 
> node, the UA must collect it and all subsequent consecutive characters 
> that would be appended to that node, and insert one Text node whose data 
> is the concatenation of all those characters."
> 
> The tokeniser produces tokens [<b>, "T", "e", "s", "t", </i>, "T", "e", 
> "s", "t"]. As I read the spec, the "T" will trigger the "append a 
> character" step, so it will collect the three subsequent consecutive 
> character tokens and append one Text node "Test". Then it will ignore 
> the end tag, and then do "append a character" again and append a new 
> Text node, so the output should be
> 
> | <html>
> |   <head>
> |   <body>
> |     <b>
> |       "Test"
> |       "Test"
> 
> But I could also read the spec as meaning that once "append a character" 
> is first run, "estTest" are the characters that will subsequently be 
> appended consecutively to the <b> node, which will give the output as in 
> tests1.dat. So it would be nice to know what is correct.

The latter (and the test) is what is correct. I've specified this in more 
detail.


> Also, what should happen with:
> 
> <b>Test<script id=s>var s=document.getElementById('s');
> s.parentNode.removeChild(s)</script>Test
> 
> ? I'm not sure how this could be implemented differently to the 
> "<b>Test</i>Test" case while following the general pattern of the HTML5 
> parser algorithm, so it should be parsed the same (whichever way that 
> is).

Yeah, there's only one text node in this case too.

The evil case is:

   <div>a<table>a</table></div>

...which should also end up with just one text node.


> Also, are UAs allowed to insert a Text node before having received all 
> the characters, and append new characters later? (e.g. for incremental 
> display of a long plain-text element). I assume that should be 
> permitted. But the spec says the node must be inserted after all the 
> characters have been collected, and I expect UAs ought not to render 
> text that isn't (yet) in the Document.

Fixed to say the characters are inserted one at a time.


> So, I think it should be defined either like:
> 
>     "When the steps below require the UA to append a character to a 
> node: If the last child of the node is a Text node, then the UA must 
> append the character to that Text node; otherwise it must create a new 
> Text node whose data is the character and append it to the node."
> 
> (which would always give "TestTest"), or like
> 
>     "When the steps below require the UA to append a character to a 
> node, the UA must create one Text node whose data is the character and 
> append it to the node. While the next token is a character token that 
> would be appended in the same insertion mode, that character must 
> instead be appended to this Text node."
> 
> (which would always give "Test","Test").

I've done the first of these, though defined in a way that isn't 
restricted to appending.


On Sat, 9 Feb 2008, Simon Pieters wrote:
> 
> http://krijnhoetmer.nl/irc-logs/whatwg/20080209#l-331
> 
> Consider
> 
>    <table><style></style></table>
> 
>    <table><script></script></table>
> 
> Both html5lib and the Validator.nu HTML parser both put <style> and 
> <script> elements in <table> instead of foster parenting them as the 
> spec says they should do. Mozilla fosterparents <style>. Safari moves 
> <style> to head. Opera and IE don't move <style>. All browsers don't 
> move <script>.
> 
> <table><style scoped> might be useful for table-specific styling. 
> (Especially when someone comes up with a working CSS solution for e.g. 
> column alignment.)
> 
> <table><script> is probably needed for roundtripping documents that do 
> <table><script>document.write(rows)</script></table>.
> 
> Therefore, I'd suggest that the spec be changed so that style and script 
> are not fosterparented when found in tables.

Done.


On Sun, 10 Feb 2008, Philip Taylor wrote:
>
> Consider the document:
> 
> <!DOCTYPE html>
> <table><td>x</td> <select><option>a<option>b <td>y</td>
> 
> In IE6, Firefox 2, Safari 3 and Opera 9.2, this is rendered as a select box
> with options "a" and "b" followed by a table containing cells "x" and "y".
> 
> HTML5 currently parses this as:
> 
> | <!DOCTYPE html>
> | <html>
> |   <head>
> |   <body>
> |     <select>
> |       <option>
> |         "a"
> |       <option>
> |         "b y"
> |     <table>
> |       <tbody>
> |         <tr>
> |           <td>
> |             "x"
> |           " "
> 
> (<http://parsetree.validator.nu/?doc=http://philip.html5.org/misc/fostered-select.html>)
> and does not handle the second cell in a compatible way.
> 
> (This happens because the <select> changes the insertion mode from "in 
> table" to "in select", and <td> is ignored in that mode.)

Fixed.


On Mon, 11 Feb 2008, Philip Taylor wrote:
> 
> <!DOCTYPE html> <table border> <b><p>a</b> <td> b
> 
> HTML5 instead produces something a bit like
> 
> |   <body>
> |     <b>
> |     <table>
> |       <p>
> |         <b>
> |           "a"
> |       <tbody>
> |         <tr>
> |           <td>
> |             "b"
> 
> because the adoption agency algorithm runs with common ancestor = <table>, and
> step 8 inserts the last node (<p>) into the common ancestor. That is bad since
> it puts the "a" inside the table.
> 
> Ideally the output would be a bit like
> 
> |   <body>
> |     <b>
> |     <p>
> |       <b>
> |         "a"
> |     <table>
> |       <tbody>
> |         <tr>
> |           <td>
> |             "b"
> 
> though I don't have any suggestions as to how to make it work like that.

Fixed. Please let me know if the fix actually works. For various reason I 
can't test it right now.


On Thu, 14 Feb 2008, Philip Taylor wrote:
> 
> http://www.w3.org/html/wg/html5/#adoptionAgency uses slightly 
> inconsistent terminology for relative positioning, which makes the text 
> slightly harder to read than necessary (especially if you're 
> implementing in a language with lists ordered like [newest; older; ...; 
> oldest] so you have to mentally flip everything backwards and 
> occasionally get it wrong).
> 
> List of active formatting elements: last, end, start, after.
>
> Stack of open elements: topmost, lower, bottom, up to, prior to, after, 
> more deeply nested.
> 
> A related issue is that "list" and "stack" sound like different data 
> structures but actually seem to be very similar (and neither is a list 
> or a stack, since they occasionally require random access). But I like 
> that difference since it provides a clear distinction between the two 
> objects and prevents confusion.

Yeah, that was my reasoning for the arbitrary distinction.


> I think it would be helpful to consistently use horizontal terminology 
> for the list (like what it says already) and vertical terminology for 
> the stack (so change "prior to" to "above", and "after" to "below", and 
> remove the "more deeply nested" parenthetical since it should be clear 
> what "below" means).

Fixed.



On Fri, 15 Feb 2008, Philip Taylor wrote:
> 
> Consider a document like <table><tr><table>
> 
> After the first two tokens, we have:
> 
> Mode: in row
> Current node: <tr>
> 
> then
> 
> Token: <table>
> "in row" "Anything else" -> "Process the token as if the insertion mode was
> "in table""
> New mode, for the duration of the reprocessing step: in table

No, that just means you jump to the "in table" section, but without 
changing the actual mode. I've rephrased the spec to make this much 
clearer. Let me know if it works for you.


On Wed, 27 Feb 2008, Simon Pieters wrote:
> 
> A page (a Sun tutorial on <applet> that is now gone) broke in Opera 
> because it had markup like:
> 
>    <p>
>     <applet>
>      <p><center></center></p>
>      Blah
>     </applet>
>    </p>
> 
> ...and we parse it as if it were:
> 
>    <p>
>     <applet>
>      <p></p><center></center></applet></p>
>      Blah
>    <p></p>
> 
> Per HTML5 currently it should be parsed as if it were:
> 
>    <p>
>     <applet>
>      </applet></p><p></p><center></center><p></p>
>       Blah
>    <p></p>
> 
> The net result in both Opera's case and HTML5 is that the applet and the 
> text "Blah" are both shown, which is not what was intended.
> 
> It appears that in IE and Firefox (but not Safari), applet is a "scoping 
> element" just like object.
> 
> I don't know how many pages break because of this, but perhaps HTML5 
> should align with IE and Firefox here.

Done.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 5 March 2008 07:46:16 UTC