Re: Tree construction: Coalescing text nodes from Jonas Sicking on 2009-11-19 (public-html@w3.org from November 2009)

From: Jonas Sicking <jonas@sicking.cc>
Date: Thu, 19 Nov 2009 00:59:13 -0800
To: Geoffrey Sneddon <gsneddon@opera.com>
Cc: Henri Sivonen <hsivonen@iki.fi>, Ian Hickson <ian@hixie.ch>, public-html@w3.org, pjt47@cam.ac.uk
Message-ID: <63df84f0911190059m1d56a113mfcc7e725795d948a@mail.gmail.com>
On Wed, Nov 18, 2009 at 8:41 AM, Geoffrey Sneddon <gsneddon@opera.com> wrote:
> Henri Sivonen wrote:
>>
>> On Nov 13, 2009, at 14:15, Geoffrey Sneddon wrote:
>>
>>> Henri Sivonen wrote:
>>>>
>>>> On Nov 13, 2009, at 12:06, Geoffrey Sneddon wrote:
>>>>>
>>>>> However, I think that such implementations are probably more important
>>>>> in terms of the structure of the DOM created (because they are more likely
>>>>> to support scripting), and as such it seems silly to have anything apart
>>>>> from a single text node in all cases, especially when such implementations
>>>>> can likely have a single text node backed by multiple strings internally.
>>>>
>>>> It's not necessarily silly not to require browsers to coalesce in all
>>>> cases. Would you make parser-inserted text nodes coalesce into
>>>> script-created text nodes or parser-created older-than-previous text nodes
>>>> that a script has moved around?
>>>
>>> No, but I would expect the parser (without executing any script) to
>>> always create a DOM with no adjacent text nodes. If you start manually
>>> manipulating the DOM via scripting I'd expect to end up with the DOM I
>>> created (e.g., if I appendChild a text node I would expect a text node to be
>>> appended, I wouldn't expect, ever, to get a single text node if there was
>>> already a text node as the last child).
>>
>> That wasn't quite the case I was asking about. I concretely, I was asking
>> about the following (illustrated here as document.write but I'm also asking
>> about the case where the document.write boundaries are network buffer
>> boundaries instead):
>> document.write("<div id=thediv>");
>>
>> document.getElementById('thediv').appendChild(document.createTextNode("foo"));
>> document.write("bar");
>>
>> One text node with data "foobar" or two text nodes: "foo" followed by
>> "bar"? Does it matter?
>
> I would expect that to create a single text node. I would intuitively expect
> the parser to append if the last child is a text node. If you reversed the
> last two lines I'd expect to get two.

I think this adds unnecessary complexity and performance cost to the
parser. The intuitive (to me) implementation is for the parser to keep
a reference to the last textnode it has inserted. Whenever more text
data is parsed, append the text to that textnode. Whenever non-text
data is parsed, drop the reference to the textnode. Whenever text data
comes in and the parser doesn't hold a reference to the textnode,
create a new textnode and append to the end of the current insertion
container.

The only case where this breaks down is if someone mixes DOM insertion
with document.write or network-parser inserted content. Mixing
document.write with DOM insertions seems like a very odd coding
pattern to me, so I don't see a reason to optimize for it. Mixing
network-parser and DOM insertion is inheritely racy, so here it's
arguably even beneficial if parser-created nodes and DOM created nodes
are never merged.

>> document.write("<div id=thediv>");
>> document.write("foo");
>> document.write("bar");
>>
>> One text node with data "foobar" or two text nodes: "foo" followed by
>> "bar"? Does it matter?
>
> One text node.

I agree. Though note that after the second call to document.write the
textnode must exist in the DOM.

> Another interesting example:
>
> document.write("<div id=thediv>1");
> document.getElementById('thediv').appendChild(document.createTextNode("2"));
> document.write("3");
>
> Current behaviour here is interesting:
>
> Firefox gives two text nodes, the first containing "13" and the second
> containing "2" (this seems counter-intuitive, so I hope we can all agree
> that this shouldn't be done).

I disagree, see above.

> WebKit gives two text nodes also, the first containing "2" and the second
> containing "13" (i.e., the same as Firefox but in the opposite order; this
> too seems counter-intuitive).

The reverse order here is surprising to me. Is there a textnode for
the "1" in the DOM after the first document.write?

>> In foster-parenting cases, that's not enough. Consider:
>> <table><tr>f<td>c</td>f
>>
>> Here, when the second 'f' is foster-parented, the cell content 'c' is the
>> text node the parser inserted last. Now, if foster-parenting examines the
>> DOM to see if the foster parent already has a text node previous sibling (in
>> order to merely extend that text node), the previous sibling could be
>> script-created.

Or even worse:

document.write("<table>   ");
document.write("x");
document.write("<td></table>");

After the first document.write, is there a textnode in the DOM for the
whitespace? Where does it appear? Is it foster-parented or not?
After the second document.write, where is the "x" inserted? In its own
textnode or somehow coalesced with the whitespace node?

/ Jonas
Received on Thursday, 19 November 2009 09:00:10 UTC