Motivation of the generic [R]CDATA parsing algorithm (detailed review of parsing algorithm) from Henri Sivonen on 2007-07-01 (public-html@w3.org from July 2007)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sun, 1 Jul 2007 21:33:14 +0300
To: "public-html@w3.org WG" <public-html@w3.org>
Message-Id: <57CE4D71-DD48-4FD2-ACD5-45653D404C6B@iki.fi>

(This is part of my detailed review the parsing algorithm.)

In the tree construction part of the parsing algorithm, the rationale  
for formulating the generic [R]CDATA parsing algorithm the way it is  
formulated is not given. The formulation is unusual compared to the  
rest of the chapter, so it is reasonable to expect that there's a  
specific reason why it is written the way it is written.

My practical concern is this:
In my implementation the tokenizer owns the main processing loop.  
Therefore, the tree builder can only change its state on a per-token  
basis and cannot pull another token in response to processing one  
token. (Instead, it can set its own flags, return control to the  
tokenizer and wait for the tokenizer to call back into the tree  
builder again.)

I have solved the problem as follows:

cdataOrRcdataTimesToPop is initialized to 0.

When the spec invokes the generic [R]CDATA parsing algorithm, instead  
of running it, do the following:
1. If the context node is the current node,
  1a. Create an element for the token.
  1b. Push the element.
  1c. Set the content model flag of the tokenizer.
  1d. Set cdataOrRcdataTimesToPop to 1.
2. Otherwise, if the context node is not the current node,
  2a. Push the context node.
  2b. Create an element for the token.
  2c. Push the element.
  2d. Set the content model flag of the tokenizer.
  2e. Set cdataOrRcdataTimesToPop to 2.

Modify the processing of character tokens and end tag tokens as follows:

3. If a character token is seen and cdataOrRcdataTimesToPop > 0,
  3a. Append the character token to the current node.
  3b. Omit the normal processing of character tokens.
4. If an end tag token is seen and cdataOrRcdataTimesToPop > 0,
  (The token will always be the end tag for the [R]DATA element.)
  4a. Pop cdataOrRcdataTimesToPop times.
  4b. Set cdataOrRcdataTimesToPop to 0.
  4c. Omit normal end tag token processing.

I'd like to know if this transformation breaks some important  
property caused by the formulation of the spec.

Specifically, the spec says:
> 7. If the next token is an end tag token with the same tag name as  
> the start tag token, ignore it. Otherwise, this is a parse error.

How could you see any other token but an end tag token with the same  
tag name as the start tag token, a character token or EOF?

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Sunday, 1 July 2007 18:33:25 UTC