Re: Streamable non-fatal non-conforming HTML parser error recovery strategy

On Fri, 04 Oct 2013 08:31:15 +0200, Michael[tm] Smith <mike@w3.org> wrote:

> Hi Simon,
>
> I'd be game for taking at shot at implementing this as an additional  
> parser
> mode, if Henri thinks it's a good idea.

Cool.

I noticed some changes below are not necessary.

It seems that v.nu in streaming mode already violates the spec when it  
comes to comments after </body>.

http://qa-dev.w3.org:8888/parsetree/?parser=html5&content=<%21doctype+html><body><%2Fbody><%21---->+&submit=Print+Tree

We can take the same approach with comment after </head> -- just insert it  
in head.


>> Needs more tweaking around frameset if checking frameset documents is
>> desired.

Looks like the missing piece is just handling a comment in after after  
frameset.

>> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#the-after-head-insertion-mode
>>
>>   A character token that is one of U+0009 CHARACTER TABULATION, U+000A
>>   LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR),
>>   or U+0020 SPACE
>> - Insert the character.
>> + Ignore the token.

Revert this (it would insert to head, which is fine).

>>   A comment token
>> - Insert a comment.
>> + Ignore the token.

Revert this (it would insert to head, which is fine).

>>   A start tag whose tag name is "html"
>> - Process the token using the rules for the "in body" insertion mode.
>> + Ignore the token.

Revert this (in body would also ignore the token).

>>   A character token that is one of U+0009 CHARACTER TABULATION, U+000A
>>   LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or
>>   U+0020 SPACE
>> - Process the token using the rules for the "in body" insertion mode.
>> + Ignore the token.

Revert this.

>>   A comment token
>> - Insert a comment as the last child of the first element in the stack
>> - of open elements (the html element).
>> + Ignore the token.

Instead:

+ Process the token using the rules for the "in body" insertion mode.

>> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#the-after-after-body-insertion-mode
>>
>>   A comment token
>> - Insert a comment as the last child of the Document object.
>> + Ignore the token.

Instead:

+ Process the token using the rules for the "in body" insertion mode.

>>   A DOCTYPE token
>> - A character token that is one of U+0009 CHARACTER TABULATION, U+000A
>> - LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR),
>> - or U+0020 SPACE
>>   A start tag whose tag name is "html"
>>   Process the token using the rules for the "in body" insertion mode.
>>
>> + A character token that is one of U+0009 CHARACTER TABULATION, U+000A
>> + LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR),
>> + or U+0020 SPACE
>> + Ignore the token.

Revert this.


So, new version, doing the above and fixing frameset:

http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-inhead

   An end tag whose tag name is "head"
- Pop the current node (which will be the head element) off the stack of
open elements.

   Anything else
- Pop the current node (which will be the head element) off the stack of
- open elements.

http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#the-after-head-insertion-mode

   A start tag whose tag name is "body"
+ Pop the current node (which will be the head element) off the stack of
+ open elements.

   A start tag whose tag name is "frameset"
+ Pop the current node (which will be the head element) off the stack of
+ open elements.

   A start tag whose tag name is one of: "base", "basefont", "bgsound",
   "link", "meta", "noframes", "script", "style", "template", "title"
   Parse error.
- Push the node pointed to by the head element pointer onto the stack of
- open elements.
   Process the token using the rules for the "in head" insertion mode.
- Remove the node pointed to by the head element pointer from the stack
- of open elements. (It might not be the current node at this point.)

   Anything else
+ Pop the current node (which will be the head element) off the stack of
+ open elements.

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#reconstruct-the-active-formatting-elements

- 1. If there are no entries in the list of active formatting elements,
-    then there is nothing to reconstruct; stop this algorithm.
+ 1. Stop this algorithm.

(This isn't necessary for streaming, but is nice for not flooding errors  
about a typoed formatting end tag.)

http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-inbody

   A start tag whose tag name is "html"
   Parse error.
- If there is a template element on the stack of open elements, then
- ignore the token.
- Otherwise, for each attribute on the token, check to see if the
- attribute is already present on the top element of the stack of open
- elements. If it is not, add the attribute and its corresponding value
- to that element.
+ Ignore the token.

   A start tag whose tag name is "body"
   Parse error.
   If the second element on the stack of open elements is not a body
   element, if the stack of open elements has only one node on it, or if
   there is a template element on the stack of open elements, then ignore
   the token. (fragment case)
- Otherwise, set the frameset-ok flag to "not ok"; then, for each
- attribute on the token, check to see if the attribute is already
- present on the body element (the second element) on the stack of open
- elements, and if it is not, add the attribute and its corresponding
- value to that element.
+ Ignore the token.

   A start tag whose tag name is "frameset"
   Parse error.
- If the stack of open elements has only one node on it, or if the
- second element on the stack of open elements is not a body element,
- then ignore the token. (fragment case)
- If the frameset-ok flag is set to "not ok", ignore the token.
- Otherwise, run the following steps:
- Remove the second element on the stack of open elements from its
- parent node, if it has one.
- Pop all the nodes from the bottom of the stack of open elements, from
- the current node up to, but not including, the root html element.
- Insert an HTML element for the token.
- Switch the insertion mode to "in frameset".
+ Ignore the token.

http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#adoption-agency-algorithm

- 2. Let outer loop counter be zero.
+ 2. Stop this algorithm.

http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#foster-parent

- 7. Let adjusted insertion location be inside previous element, after
- its last child (if any).
+ 7. Let adjusted insertion location be inside target, after its last
+ child (if any).

http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-afterbody

   A comment token
- Insert a comment as the last child of the first element in the stack
- of open elements (the html element).
+ Process the token using the rules for the "in body" insertion mode.

http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#the-after-after-body-insertion-mode

   A comment token
- Insert a comment as the last child of the Document object.
+ Process the token using the rules for the "in body" insertion mode.

http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#the-after-after-frameset-insertion-mode

   A comment token
- Insert a comment as the last child of the Document object.
+ Process the token using the rules for the "in body" insertion mode.

-- 
Simon Pieters
Opera Software

Received on Friday, 4 October 2013 11:14:48 UTC