[whatwg] Parser-related feedback from Ian Hickson on 2010-02-11 (public-whatwg-archive@w3.org from February 2010)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 11 Feb 2010 02:40:28 +0000 (UTC)
Message-ID: <Pine.LNX.4.64.1002100719470.27124@ps20323.dreamhostps.com>
On Thu, 29 Oct 2009, Matt Hall wrote:
>
> Prior to r4177, the matching of tag names for exiting the RCDATA/RAWTEXT 
> states was done as follows:
>
> "...and the next few characters do no match the tag name of the last 
> start tag token emitted (compared in an ASCII case-insensitive manner)"
> 
> However, the current revision doesn't include any comment on character 
> casing in its discussion of "Appropriate End Tags."  Similarly, certain 
> tokenizer states require that you check the contents of the "temporary 
> buffer" against the string "script" but there is no indication of 
> whether or not to do this in a case-insensitive manner.
>
> In both cases, should this comparison be done in an ASCII 
> case-insensitive manner or not? It might be helpful to clarify the spec 
> in both places in either case.

On Thu, 29 Oct 2009, Geoffrey Sneddon wrote:
> 
> It is already case-insensitive as you lowercase the characters when 
> creating the token name and when adding them to the buffer.

Indeed.


On Fri, 30 Oct 2009, Matt Hall wrote:
>
> When the "script data" state was added to the tokenizer, the tree 
> construction algorithm was updated to switch the tokenizer into this 
> state upon finding a start tag named "script" while in the "in head" 
> insertion mode (9.2.5.7). I see that a corresponding change was not made 
> to 9.5 about "Parsing HTML Fragments" as it still says to switch into 
> the RAWTEXT state upon finding a "script" tag. Does anyone know if this 
> difference is intentional, or did someone just forget to update the 
> fragment parsing case?

There's a comment now mentioning this explicitly. Is it ok?


On Tue, 10 Nov 2009, Kartikaya Gupta wrote:
>
> If you have a page like this:
> 
> <!DOCTYPE HTML>
> <html><body>
> <font size="2" face="Verdana">
> <p align="left">Some text
> <font size="2" face="Verdana">
> <p align="left">Some text
> </body></html>
> 
> according to the HTML5 parser rules, I believe this should create a DOM with 3 font elements that looks something like this:
> 
> <!DOCTYPE HTML><HTML><HEAD></HEAD><BODY>
> <FONT face="Verdana" size="2">
> <P align="left">Some text
> <FONT face="Verdana" size="2">
> </FONT></P><P align="left"><FONT size="2" face="Verdana">Some text
> 
> </FONT></P></FONT></BODY></HTML>
> 
> However, if you add extend the original source with another font/p combination, like so:
> 
> <!DOCTYPE HTML>
> <html><body>
> <font size="2" face="Verdana">
> <p align="left">Some text
> <font size="2" face="Verdana">
> <p align="left">Some text
> <font size="2" face="Verdana">
> <p align="left">Some text
> </body></html>
> 
> You end up with a DOM which has 6 font elements:
> 
> <!DOCTYPE HTML><HTML><HEAD></HEAD><BODY>
> <FONT face="Verdana" size="2">
> <P align="left">Some text
> <FONT face="Verdana" size="2">
> </FONT></P><P align="left"><FONT size="2" face="Verdana">Some text
> <FONT face="Verdana" size="2">
> </FONT></FONT></P><P align="left"><FONT face="Verdana" size="2"><FONT size="2" face="Verdana">Some text
> 
> </FONT></FONT></P></FONT></BODY></HTML>
> 
> .. and so on. In general the number of font elements in the DOM grows 
> polynomially, with the result that pages like [1] and [2] end up with 
> hundreds of thousands of font elements. I haven't even been able to 
> successfully parse [3] with either our own HTML5 parser or the one at 
> validator.nu, it just gobbles up all available memory and asks for more.
> 
> [1] http://www.miprepzone.com/past.asp?Category=%27news%27
> [2] http://info4.juridicas.unam.mx/ijure/tcfed/8.htm?s=
> [3] http://info4.juridicas.unam.mx/ijure/tcfed/1.htm?s=
> 
> Is this behavior expected, or is it a bug in the spec? Obviously 
> shipping browsers don't demonstrate this behavior (nor does Firefox's 
> HTML5 parser - see bugzilla 525960) so I'm wondering if the spec could 
> be modified to not have this polynomial-growth behavior.

I haven't checked if the exact behaviour you describe is what the spec 
currently requires, but in general, there will always be cases where input 
has a disproportional result on output, because backwards-compatible fixup 
is basically contrained to very few possibilities, all of which have this 
behaviour in certain cases.

In practice, it's not a huge issue, because you have to cope with these 
cases even just to handle regular valid documents -- consider for example 
an infinite document whose body is just <font><font><font><font>... with 
no close tags. There are a number of pages on the Web that approximate 
this on the Web, for example:

   http://www.frikis.org/images/ascii/tux.html


On Tue, 24 Nov 2009, Daniel Glazman wrote:
> 
> I think that insertAdjacentHTML as defined in current section 3.5.7 [1] 
> could be much cleaner and clearer if
> 
> 1 - "Adjacent" was dropped. It's useless. The name could be insertHTML.
> 
> 2. if the values were "before", "firstchild", "lastchild", after"
>    instead of the current "beforebegin", "afterbegin", "beforend" and
>    "afterend" that seem to me visually related to start and end tags
>    and not the element itself. Consistency with the existing DOM
>    phraseology seems to me useful.
> 
> [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/apis-in-html-documents.html#insertadjacenthtml%28%29

On Tue, 24 Nov 2009, Anne van Kesteren wrote:
> 
> The problem is that it is a legacy feature, much like innerHTML.

On Tue, 24 Nov 2009, Daniel Glazman wrote:
> 
> That's not a problem. Make insertHTML with the new values and make 
> insertAdjacentHTML with the old values just an alias to the new ones. Or 
> the contrary. Or whatever. But it's not because it's shipped by MS that 
> way that we must stick forever to such an horrible definition...

On Tue, 24 Nov 2009, Anne van Kesteren wrote:
> 
> That is actually pretty much how we do it for every feature (consider 
> e.g. XMLHttpRequest) because otherwise we have to duplicate too much 
> functionality which only increases complexity. I.e. more tests, more 
> APIs floating around, more documentation, backwards compatibility issues 
> (new duplicate APIs won't work, but the ones they reflect do), etc.

On Wed, 25 Nov 2009, Dean Edwards wrote:
> 
> Adding aliases does not reduce the horribleness of an API.

On Wed, 25 Nov 2009, Daniel Glazman wrote:
> 
> Correct. But at least it eases a bit the pain and allows future 
> deprecation.

We can basically never drop anything, so aliases in practice don't really 
help. I haven't added an alias here, as I don't see much advantage to 
doing so. The proposed alternative names aren't much better.


On Wed, 2 Dec 2009, ATSUSHI TAKAYAMA wrote:
> 
> This was posted by Akatsuki Kitamura on the W3C Japanese Interest Group 
> Mailing List.
> 
> [...quoting the syntax section...]
> 
> As far as I understand, if I want to write a void element with no 
> attribute, such as the br, I do steps 1 ("<" character) and 2 (tag 
> name), then ignore 3 and 4. In the step 5, since I don't have any 
> attributes, the "after the attribute" situation does not apply here, so 
> I ignore it too. Then I close the tag by going through step 6 ("/" 
> character) and step 7 (">" character).
> 
> Akatsuki's question was that if you write space characters before
> closing the tag like the following, if they are still valid or not.
> 
> <br >
> <br />
> 
> I think the step 5 should be written as;
> 
> After the attributes, or after the element's tag name if there are no
> attributes, then there may be one or more space characters.

Done.


On Tue, 9 Feb 2010, Biju wrote:
>
> What should a user agent display when html content is...
> 
> <html><body>
> <%@ page language="java" %>
> </body></html>
>
> [...]

On Tue, 9 Feb 2010, Tab Atkins Jr. wrote:
>
> All of these cases appear to be an ASP or PHP page that is accidentally 
> being sent as ordinary html.  You shouldn't be seeing these tags at all 
> in the source of the page unless a server is misconfigured.
> 
> That said, given that you *are* seeing them, I'm not certain what the 
> correct behavior is, but it's definitely strictly defined in HTML5. Can 
> someone else with more familiarity with the parser algorithm help out 
> here?

On Wed, 10 Feb 2010, Boris Zbarsky wrote:
> 
> For the "<%@" case, it looks like the state machine will go through the 
> following states:
> 
>   Data state -> Tag open state
> 
> When encountering a '%' in the "Tag open" state, the specification says:
> 
>     Parse error. Emit a U+003C LESS-THAN SIGN character token
>     and reconsume the current input character in the data state.
> 
> So the state will then remain "Data state" until the next '&' or '<' or EOF is
> seen, so the entire string up to the </body> will be treated as literal text.
> 
> For the "<?" case, the state transitions will be:
> 
>   Data state -> Tag open state -> Bogus comment state
> 
> Then the specification says to:
> 
>   Consume every character up to and including the first U+003E
>   GREATER-THAN SIGN character (>) or the end of the file (EOF),
>   whichever comes first. Emit a comment token whose data is the
>   concatenation of all the characters starting from and including
>   the character that caused the state machine to switch into the bogus
>   comment state, up to and including the character immediately before
>   the last consumed character (i.e. up to the character just before the
>   U+003E or EOF character). (If the comment was started by the end of
>   the file (EOF), the token is empty.)
> 
>   Switch to the data state.
> 
> Or in other words, stop the bogus comment at the first '>' you see and 
> then start parsing normally again.  In this case, that means treating 
> everything up to the next '<' or '&' or EOF as literal text.
> 
> So the currently-specified behavior in fact matches the observed Firefox 
> behavior (with either parser) on these simple testcases.

Sounds right.


On Wed, 10 Feb 2010, Biju wrote:
> 
> At least in one page I saw, which was Case 1 and page was originally 
> from a JSP or ASP template later modified and saved as a *.html

I recommend fixing the page. :-)


> So will IE and Safari (may be chrome also, i have not tested it) follow 
> Firefox way?

Hard to say. You'd have to ask Microsoft.


> Personally I prefer the IE way as I think one may able to make a simple 
> PHP or JSP editor just using contentEditable feature.

Unfortunately the <%...%> stuff wouldn't round-trip correctly, since 
there's no way to represent it in the DOM. So you couldn't really make a 
PHP or JSP editor using contentEditable that way.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 10 February 2010 18:40:28 UTC