[whatwg] On tag inference

On Mon, 29 Aug 2005, Henri Sivonen wrote:
> What kind of approach to tag inference can HTML5 be expected to take?

It uses a specific algorithm modelled on a finite state machine.

> Example:
> <p><foo>
> Is 'foo' an element that not allowed as a child of 'p' and, therefore,
> implicitly closes the 'p'? Or is 'foo' not on the list of elements that close
> 'p' and, therefore, does not implicitly close it? Which way are the inference
> rules going to be defined?

<foo> is a phrasing element (as are all unknown elements) and is therefore 
treated like <span>, and never closes a block.

The inference rules are defined basically on a case-by-case basis.

> * I am assuming an implementation maintains a stack of open elements or works
> directly on a parser tree in which case the path from the current node to the
> root has the right same role as the stack.

Agreed. The spec uses this model too.

> As far as I can tell, there are four kinds of inference needed when parsing
> *conforming* documents (ie. no second stack for residual style):
> 1) Element end causes the end of the elements that [are] on the top of the
> stack*.

These cases are common and vary in the details, but as a whole are those 
that say "generate implied end tags".

> 2) End of the data stream causes the end of the element that is on the top of
> the stack.

The end of the data stream is handled just as a special token, and ends 
processing (so the state of end tags is largely irrelevant, though as 
defined it does close some of the tags, yes).

> 3) Element start causes the end of the element that is on the top of the
> stack.

This only happens for <li>, <p>, <dt> and <dd> but is indeed one of the 
cases. Each of those four elements has special ways of handling the 
closing of previous elements of that type and of the other three types.

> 4) Element start causes another element start before itself.

These are handled by the state machine model.

> Is this list complete?

Some end tags can cause other elements to close even though they don't 
match, e.g. </h1> closing <h2>. There's also the complex case of residual 
style inlines ("formatting" elements). As a whole the spec handles each of 
these cases separately. Since the rules vary so much from element to 
element it's hard to be specific about which cases are "end tag 

> I am assuming that for *conforming* documents, the above-mentioned 
> inference decisions can be taken by observing the top of the stack and 
> the element name associated with the current end or start element event. 
> Correct? (I am assuming the rules may be applied repeatedly. Ie. null on 
> stack and start 'title' implies 'html' start. 'html' on stack and start 
> 'title' implies 'head' start. 'head' on stack and start 'title' implies 
> nothing and the start 'title' goes through.)

I don't know, I haven't considered conforming documents, they are an edge 
case which the spec handles by virtue of handling the common case.

> It seems to me that #3 is the tricky case in terms of interaction with 
> unknown element names. #1 and #2 require a list of elements whose end 
> tag is optional. #4 requires a map of top of stack plus current start 
> pairs to inferred start tags.

Unknown elemenst turn out to be near-trivial to cater for, because they 
are the simplest kind of tag -- you treat them as inlines that are closed 
by any end tag that isn't correctly nested, and you make their start tags 
have no effect on tag inference.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Friday, 10 March 2006 13:12:53 UTC