[whatwg] Tag Soup: Blocks-in-inlines from Henri Sivonen on 2006-01-25 (public-whatwg-archive@w3.org from January 2006)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 25 Jan 2006 18:21:55 +0200
Message-ID: <75914E68-E052-46B6-AE87-861B7A8E2D46@iki.fi>
On Jan 25, 2006, at 12:09, Lachlan Hunt wrote:

> This is in response to Hixie's article [1].

I had had such a strong intuitive assumption of what Gecko and  
WebCore were doing that I was surprised to learn their behavior is  
indeed much hairier. (I hadn't even verified my assumption by  
checking the sources, because it seemed so obvious to me that Gecko &  
WebCore were doing what I thought they were doing...)

Anyway, here's what I thought they were doing:

There's low-level parser is kind of like a tag-level lexer and emits  
a (non-well-formed) sequence of SAX-like events like startTag,  
characters, endTag and comment (in my parser* HtmlParser.java). These  
events don't go to the DOM builder / content sink directly. Instead,  
there's a filter layer that takes care of tag inference and emits a  
well-formed event stream (TagInferenceFilter.java and  
EmptyElementFilter.java in my parser). Additionally, there's a filter  
(not present in my parser, which is designed for conformance  
checking; this may need to be integrated into the tag inference  
filter) that performs the "residual style" fixups. It works like this  
(assuming that there is no need for legitimate tag inference at the  
same time):

A stack is used for keeping track of the open elements. When startTag  
is seen, the topmost element of the stack and the name of the new  
element are compared to a static table to see if the new element can  
occur as a child of the topmost element on the stack. If it can, the  
new element is pushed on the stack and echoed forward in the pipeline.

If the element start was for an inline element, a second residual  
style stack is inspected. This also happens when characters are  
reported. If there are items in the residual style stack, the stack  
is popped and the popped element is echoed forward in the pipeline  
and pushed onto the open element stack. The items on the stacks  
include not only element names but attributes as well.

When the residual style stack is empty, the inline content (startTag  
of an inline element or characters) from the lower layer is echoed  
forward in the pipeline (pushing the element on the open element  
stack if it was startTag and not characters).

When an endTag is seen, if it matches the topmost item of the open  
element stack, the stack is popped end the endTag event (now actually  
an endElement event) is echoed forward in the pipeline.

If, however, the endTag and the open element stack do not match, the  
open element stack is searched until the first non-inline element. If  
a matching start for the endTag is found before or at the first non- 
inline element, the stack is popped and the popped item echoed  
forward in the pipeline and pushed onto the residual style stack  
until the matching start is found (at which point the element is  
close as above). If the matching start is not found before or at the  
first non-inline element on the stack, the endTag event is discarded.

Whenever items are pushed onto the residual style stack, it is  
considered an easy parse error.

Perhaps this model is a simple enough model to be deterministically  
specified but still good enough an approximation of Gecko's and  
WebCore's behavior. All decisions are local to the parse event being  
observed and do not involve reshuffling the parts of the DOM that  
have already been built.

* http://hsivonen.iki.fi/validator-about/htmlparser.jar (with source)

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 25 January 2006 08:21:55 UTC