Coalescing characters in Text nodes from Philip Taylor on 2008-02-08 (public-html@w3.org from February 2008)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Fri, 08 Feb 2008 14:45:58 +0000
To: HTML WG <public-html@w3.org>
Message-ID: <47AC6B26.3080000@cam.ac.uk>
http://html5lib.googlecode.com/svn/trunk/testdata/tree-construction/tests1.dat 
has the following test case:

#data
<b>Test</i>Test
#errors
Line: 1 Col: 3 Unexpected start tag (b). Expected DOCTYPE.
Line: 1 Col: 11 End tag (i) violates step 1, paragraph 1 of the adoption 
agency algorithm.
Line: 1 Col: 15 Expected closing tag. Unexpected end of file.
#document
| <html>
|   <head>
|   <body>
|     <b>
|       "TestTest"

The text-node coalescence is defined in 
http://www.w3.org/html/wg/html5/#append as:

     "When the steps below require the UA to append a character to a 
node, the UA must collect it and all subsequent consecutive characters 
that would be appended to that node, and insert one Text node whose data 
is the concatenation of all those characters."

The tokeniser produces tokens [<b>, "T", "e", "s", "t", </i>, "T", "e", 
"s", "t"]. As I read the spec, the "T" will trigger the "append a 
character" step, so it will collect the three subsequent consecutive 
character tokens and append one Text node "Test". Then it will ignore 
the end tag, and then do "append a character" again and append a new 
Text node, so the output should be

| <html>
|   <head>
|   <body>
|     <b>
|       "Test"
|       "Test"

But I could also read the spec as meaning that once "append a character" 
is first run, "estTest" are the characters that will subsequently be 
appended consecutively to the <b> node, which will give the output as in 
tests1.dat. So it would be nice to know what is correct.


Also, what should happen with:

<b>Test<script id=s>var s=document.getElementById('s'); 
s.parentNode.removeChild(s)</script>Test

? I'm not sure how this could be implemented differently to the 
"<b>Test</i>Test" case while following the general pattern of the HTML5 
parser algorithm, so it should be parsed the same (whichever way that is).


Firefox 2, Opera 9.5 and Safari 3 create two adjacent text nodes in the 
<script> case, and IE6 can't be tested since it doesn't delete the <script>.

In the </i> case, Firefox produces one text node, Opera and Safari 
produce two, and IE can't be tested since it makes an element named "/I".

Using "<b>Test</li>Test" instead, IE6 produces one text node, and the 
others behave the same as with </i>.


Also, are UAs allowed to insert a Text node before having received all 
the characters, and append new characters later? (e.g. for incremental 
display of a long plain-text element). I assume that should be 
permitted. But the spec says the node must be inserted after all the 
characters have been collected, and I expect UAs ought not to render 
text that isn't (yet) in the Document.


So, I think it should be defined either like:

     "When the steps below require the UA to append a character to a 
node: If the last child of the node is a Text node, then the UA must 
append the character to that Text node; otherwise it must create a new 
Text node whose data is the character and append it to the node."

(which would always give "TestTest"), or like

     "When the steps below require the UA to append a character to a 
node, the UA must create one Text node whose data is the character and 
append it to the node. While the next token is a character token that 
would be appended in the same insertion mode, that character must 
instead be appended to this Text node."

(which would always give "Test","Test").

-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Friday, 8 February 2008 14:46:41 UTC