- From: Michael[tm] Smith <mike@w3.org>
- Date: Fri, 4 Oct 2013 15:31:15 +0900
- To: Simon Pieters <simonp@opera.com>
- Cc: Henri Sivonen <hsivonen@iki.fi>, "www-archive@w3.org" <www-archive@w3.org>
Hi Simon, I'd be game for taking at shot at implementing this as an additional parser mode, if Henri thinks it's a good idea. --Mike Simon Pieters <simonp@opera.com>, 2013-10-03 16:25 +0200: > Some rough notes on an alternative error recovery strategy in the HTML > parser for validators that is streamable and non-fatal and hopefully enables > more useful error messages. > > Needs more tweaking around frameset if checking frameset documents is > desired. > > > http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-inhead > > An end tag whose tag name is "head" > - Pop the current node (which will be the head element) off the stack of > open elements. > > Anything else > - Pop the current node (which will be the head element) off the stack of > - open elements. > > http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#the-after-head-insertion-mode > > A character token that is one of U+0009 CHARACTER TABULATION, U+000A > LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), > or U+0020 SPACE > - Insert the character. > + Ignore the token. > > A comment token > - Insert a comment. > + Ignore the token. > > A start tag whose tag name is "html" > - Process the token using the rules for the "in body" insertion mode. > + Ignore the token. > > A start tag whose tag name is "body" > + Pop the current node (which will be the head element) off the stack of > + open elements. > > A start tag whose tag name is "frameset" > + Pop the current node (which will be the head element) off the stack of > + open elements. > > A start tag whose tag name is one of: "base", "basefont", "bgsound", > "link", "meta", "noframes", "script", "style", "template", "title" > Parse error. > - Push the node pointed to by the head element pointer onto the stack of > - open elements. > Process the token using the rules for the "in head" insertion mode. > - Remove the node pointed to by the head element pointer from the stack > - of open elements. (It might not be the current node at this point.) > > Anything else > + Pop the current node (which will be the head element) off the stack of > + open elements. > > http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#reconstruct-the-active-formatting-elements > > - 1. If there are no entries in the list of active formatting elements, > - then there is nothing to reconstruct; stop this algorithm. > + 1. Stop this algorithm. > > (This isn't necessary for streaming, but is nice for not flooding errors > about a typoed formatting end tag.) > > http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-inbody > > A start tag whose tag name is "html" > Parse error. > - If there is a template element on the stack of open elements, then > - ignore the token. > - Otherwise, for each attribute on the token, check to see if the > - attribute is already present on the top element of the stack of open > - elements. If it is not, add the attribute and its corresponding value > - to that element. > + Ignore the token. > > A start tag whose tag name is "body" > Parse error. > If the second element on the stack of open elements is not a body > element, if the stack of open elements has only one node on it, or if > there is a template element on the stack of open elements, then ignore > the token. (fragment case) > - Otherwise, set the frameset-ok flag to "not ok"; then, for each > - attribute on the token, check to see if the attribute is already > - present on the body element (the second element) on the stack of open > - elements, and if it is not, add the attribute and its corresponding > - value to that element. > + Ignore the token. > > A start tag whose tag name is "frameset" > Parse error. > - If the stack of open elements has only one node on it, or if the > - second element on the stack of open elements is not a body element, > - then ignore the token. (fragment case) > - If the frameset-ok flag is set to "not ok", ignore the token. > - Otherwise, run the following steps: > - Remove the second element on the stack of open elements from its > - parent node, if it has one. > - Pop all the nodes from the bottom of the stack of open elements, from > - the current node up to, but not including, the root html element. > - Insert an HTML element for the token. > - Switch the insertion mode to "in frameset". > + Ignore the token. > > http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#adoption-agency-algorithm > > - 2. Let outer loop counter be zero. > + 2. Stop this algorithm. > > http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#foster-parent > > - 7. Let adjusted insertion location be inside previous element, after > - its last child (if any). > + 7. Let adjusted insertion location be inside target, after its last > + child (if any). > > http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-afterbody > > A character token that is one of U+0009 CHARACTER TABULATION, U+000A > LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or > U+0020 SPACE > - Process the token using the rules for the "in body" insertion mode. > + Ignore the token. > > A comment token > - Insert a comment as the last child of the first element in the stack > - of open elements (the html element). > + Ignore the token. > > http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#the-after-after-body-insertion-mode > > A comment token > - Insert a comment as the last child of the Document object. > + Ignore the token. > > A DOCTYPE token > - A character token that is one of U+0009 CHARACTER TABULATION, U+000A > - LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), > - or U+0020 SPACE > A start tag whose tag name is "html" > Process the token using the rules for the "in body" insertion mode. > > + A character token that is one of U+0009 CHARACTER TABULATION, U+000A > + LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), > + or U+0020 SPACE > + Ignore the token. > -- Michael[tm] Smith http://people.w3.org/mike
Received on Friday, 4 October 2013 06:31:34 UTC