Re: Tree construction: Coalescing text nodes from Geoffrey Sneddon on 2009-11-18 (public-html@w3.org from November 2009)

From: Geoffrey Sneddon <gsneddon@opera.com>
Date: Wed, 18 Nov 2009 17:41:26 +0100
To: Henri Sivonen <hsivonen@iki.fi>
CC: Ian Hickson <ian@hixie.ch>, public-html@w3.org, pjt47@cam.ac.uk
Message-ID: <4B0423B6.1010202@opera.com>
Henri Sivonen wrote:
> On Nov 13, 2009, at 14:15, Geoffrey Sneddon wrote:
> 
>> Henri Sivonen wrote:
>>> On Nov 13, 2009, at 12:06, Geoffrey Sneddon wrote:
>>>> However, I think that such implementations are probably more important in terms of the structure of the DOM created (because they are more likely to support scripting), and as such it seems silly to have anything apart from a single text node in all cases, especially when such implementations can likely have a single text node backed by multiple strings internally.
>>> It's not necessarily silly not to require browsers to coalesce in all cases. Would you make parser-inserted text nodes coalesce into script-created text nodes or parser-created older-than-previous text nodes that a script has moved around?
>> No, but I would expect the parser (without executing any script) to always create a DOM with no adjacent text nodes. If you start manually manipulating the DOM via scripting I'd expect to end up with the DOM I created (e.g., if I appendChild a text node I would expect a text node to be appended, I wouldn't expect, ever, to get a single text node if there was already a text node as the last child).
> 
> That wasn't quite the case I was asking about. I concretely, I was asking about the following (illustrated here as document.write but I'm also asking about the case where the document.write boundaries are network buffer boundaries instead):
> document.write("<div id=thediv>");
> document.getElementById('thediv').appendChild(document.createTextNode("foo"));
> document.write("bar");
> 
> One text node with data "foobar" or two text nodes: "foo" followed by "bar"? Does it matter?

I would expect that to create a single text node. I would intuitively 
expect the parser to append if the last child is a text node. If you 
reversed the last two lines I'd expect to get two.

> document.write("<div id=thediv>");
> document.write("foo");
> document.write("bar");
> 
> One text node with data "foobar" or two text nodes: "foo" followed by "bar"? Does it matter?

One text node. I think making document.write behave differently than 
just inserting characters at the current position of the input stream is 
a bad idea, _especially_ if you treat network buffer boundaries the same 
way (as they /surely/ should have no effect on parsing).

I think for the does it matter question for both these two examples it's 
a question of whether we want to make another thing non-deterministic in 
the parser (the only thing currently is aborting parsing on a parse error).

Another interesting example:

document.write("<div id=thediv>1");
document.getElementById('thediv').appendChild(document.createTextNode("2"));
document.write("3");

Current behaviour here is interesting:

Firefox gives two text nodes, the first containing "13" and the second 
containing "2" (this seems counter-intuitive, so I hope we can all agree 
that this shouldn't be done).

WebKit gives two text nodes also, the first containing "2" and the 
second containing "13" (i.e., the same as Firefox but in the opposite 
order; this too seems counter-intuitive).

Opera gives three text nodes: "1", "2", and "3". This is the only 
behaviour of the three browsers that seems at all sensible.

I think the behaviour of Firefox and WebKit in this case shows the 
danger of not coalescing text nodes in the document.write and/or 
scripting case: you can very easily end up with quite weird behaviour. I 
think the easiest way to try and avoid implementations introducing such 
bugs is to just require coalescing in all cases from the parser.

> In foster-parenting cases, that's not enough. Consider: <table><tr>f<td>c</td>f
> 
> Here, when the second 'f' is foster-parented, the cell content 'c' is the text node the parser inserted last. Now, if foster-parenting examines the DOM to see if the foster parent already has a text node previous sibling (in order to merely extend that text node), the previous sibling could be script-created.

I don't think that's a real problem (that the previous-sibling could be 
script created).

> Does specifying whether foster-parented text coalesces or not really matter for interop? (I believe coalescing all non-foster-parented parser-inserted text does matter for interop.)

For interop? Yes. For web compat? No. It seems a silly thing to have as 
a single implementation specific detail (leaving out error handling for 
now).

> Is it really bad for the parser to extend script-created text nodes?

No. I think, as I said above, that always coalescing is a good idea to 
avoid weird bugs. I also think that even coalescing in such cases 
shouldn't be gratuitously expensive overall.

-- 
Geoffrey Sneddon — Opera Software
<http://gsnedders.com/>
<http://www.opera.com/>
Received on Wednesday, 18 November 2009 16:42:20 UTC