Re: Streamable non-fatal non-conforming HTML parser error recovery strategy

Hi Simon,

I'd be game for taking at shot at implementing this as an additional parser
mode, if Henri thinks it's a good idea.

  --Mike

Simon Pieters <simonp@opera.com>, 2013-10-03 16:25 +0200:

> Some rough notes on an alternative error recovery strategy in the HTML
> parser for validators that is streamable and non-fatal and hopefully enables
> more useful error messages.
> 
> Needs more tweaking around frameset if checking frameset documents is
> desired.
> 
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-inhead
> 
>   An end tag whose tag name is "head"
> - Pop the current node (which will be the head element) off the stack of
> open elements.
> 
>   Anything else
> - Pop the current node (which will be the head element) off the stack of
> - open elements.
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#the-after-head-insertion-mode
> 
>   A character token that is one of U+0009 CHARACTER TABULATION, U+000A
>   LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR),
>   or U+0020 SPACE
> - Insert the character.
> + Ignore the token.
> 
>   A comment token
> - Insert a comment.
> + Ignore the token.
> 
>   A start tag whose tag name is "html"
> - Process the token using the rules for the "in body" insertion mode.
> + Ignore the token.
> 
>   A start tag whose tag name is "body"
> + Pop the current node (which will be the head element) off the stack of
> + open elements.
> 
>   A start tag whose tag name is "frameset"
> + Pop the current node (which will be the head element) off the stack of
> + open elements.
> 
>   A start tag whose tag name is one of: "base", "basefont", "bgsound",
>   "link", "meta", "noframes", "script", "style", "template", "title"
>   Parse error.
> - Push the node pointed to by the head element pointer onto the stack of
> - open elements.
>   Process the token using the rules for the "in head" insertion mode.
> - Remove the node pointed to by the head element pointer from the stack
> - of open elements. (It might not be the current node at this point.)
> 
>   Anything else
> + Pop the current node (which will be the head element) off the stack of
> + open elements.
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#reconstruct-the-active-formatting-elements
> 
> - 1. If there are no entries in the list of active formatting elements,
> -    then there is nothing to reconstruct; stop this algorithm.
> + 1. Stop this algorithm.
> 
> (This isn't necessary for streaming, but is nice for not flooding errors
> about a typoed formatting end tag.)
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-inbody
> 
>   A start tag whose tag name is "html"
>   Parse error.
> - If there is a template element on the stack of open elements, then
> - ignore the token.
> - Otherwise, for each attribute on the token, check to see if the
> - attribute is already present on the top element of the stack of open
> - elements. If it is not, add the attribute and its corresponding value
> - to that element.
> + Ignore the token.
> 
>   A start tag whose tag name is "body"
>   Parse error.
>   If the second element on the stack of open elements is not a body
>   element, if the stack of open elements has only one node on it, or if
>   there is a template element on the stack of open elements, then ignore
>   the token. (fragment case)
> - Otherwise, set the frameset-ok flag to "not ok"; then, for each
> - attribute on the token, check to see if the attribute is already
> - present on the body element (the second element) on the stack of open
> - elements, and if it is not, add the attribute and its corresponding
> - value to that element.
> + Ignore the token.
> 
>   A start tag whose tag name is "frameset"
>   Parse error.
> - If the stack of open elements has only one node on it, or if the
> - second element on the stack of open elements is not a body element,
> - then ignore the token. (fragment case)
> - If the frameset-ok flag is set to "not ok", ignore the token.
> - Otherwise, run the following steps:
> - Remove the second element on the stack of open elements from its
> - parent node, if it has one.
> - Pop all the nodes from the bottom of the stack of open elements, from
> - the current node up to, but not including, the root html element.
> - Insert an HTML element for the token.
> - Switch the insertion mode to "in frameset".
> + Ignore the token.
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#adoption-agency-algorithm
> 
> - 2. Let outer loop counter be zero.
> + 2. Stop this algorithm.
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#foster-parent
> 
> - 7. Let adjusted insertion location be inside previous element, after
> - its last child (if any).
> + 7. Let adjusted insertion location be inside target, after its last
> + child (if any).
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-afterbody
> 
>   A character token that is one of U+0009 CHARACTER TABULATION, U+000A
>   LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or
>   U+0020 SPACE
> - Process the token using the rules for the "in body" insertion mode.
> + Ignore the token.
> 
>   A comment token
> - Insert a comment as the last child of the first element in the stack
> - of open elements (the html element).
> + Ignore the token.
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#the-after-after-body-insertion-mode
> 
>   A comment token
> - Insert a comment as the last child of the Document object.
> + Ignore the token.
> 
>   A DOCTYPE token
> - A character token that is one of U+0009 CHARACTER TABULATION, U+000A
> - LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR),
> - or U+0020 SPACE
>   A start tag whose tag name is "html"
>   Process the token using the rules for the "in body" insertion mode.
> 
> + A character token that is one of U+0009 CHARACTER TABULATION, U+000A
> + LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR),
> + or U+0020 SPACE
> + Ignore the token.
> 

-- 
Michael[tm] Smith http://people.w3.org/mike

Received on Friday, 4 October 2013 06:31:34 UTC