- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Wed, 25 Jan 2006 18:21:55 +0200
On Jan 25, 2006, at 12:09, Lachlan Hunt wrote: > This is in response to Hixie's article [1]. I had had such a strong intuitive assumption of what Gecko and WebCore were doing that I was surprised to learn their behavior is indeed much hairier. (I hadn't even verified my assumption by checking the sources, because it seemed so obvious to me that Gecko & WebCore were doing what I thought they were doing...) Anyway, here's what I thought they were doing: There's low-level parser is kind of like a tag-level lexer and emits a (non-well-formed) sequence of SAX-like events like startTag, characters, endTag and comment (in my parser* HtmlParser.java). These events don't go to the DOM builder / content sink directly. Instead, there's a filter layer that takes care of tag inference and emits a well-formed event stream (TagInferenceFilter.java and EmptyElementFilter.java in my parser). Additionally, there's a filter (not present in my parser, which is designed for conformance checking; this may need to be integrated into the tag inference filter) that performs the "residual style" fixups. It works like this (assuming that there is no need for legitimate tag inference at the same time): A stack is used for keeping track of the open elements. When startTag is seen, the topmost element of the stack and the name of the new element are compared to a static table to see if the new element can occur as a child of the topmost element on the stack. If it can, the new element is pushed on the stack and echoed forward in the pipeline. If the element start was for an inline element, a second residual style stack is inspected. This also happens when characters are reported. If there are items in the residual style stack, the stack is popped and the popped element is echoed forward in the pipeline and pushed onto the open element stack. The items on the stacks include not only element names but attributes as well. When the residual style stack is empty, the inline content (startTag of an inline element or characters) from the lower layer is echoed forward in the pipeline (pushing the element on the open element stack if it was startTag and not characters). When an endTag is seen, if it matches the topmost item of the open element stack, the stack is popped end the endTag event (now actually an endElement event) is echoed forward in the pipeline. If, however, the endTag and the open element stack do not match, the open element stack is searched until the first non-inline element. If a matching start for the endTag is found before or at the first non- inline element, the stack is popped and the popped item echoed forward in the pipeline and pushed onto the residual style stack until the matching start is found (at which point the element is close as above). If the matching start is not found before or at the first non-inline element on the stack, the endTag event is discarded. Whenever items are pushed onto the residual style stack, it is considered an easy parse error. Perhaps this model is a simple enough model to be deterministically specified but still good enough an approximation of Gecko's and WebCore's behavior. All decisions are local to the parse event being observed and do not involve reshuffling the parts of the DOM that have already been built. * http://hsivonen.iki.fi/validator-about/htmlparser.jar (with source) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Wednesday, 25 January 2006 08:21:55 UTC