- From: Ian Hickson <ian@hixie.ch>
- Date: Thu, 11 Feb 2010 02:40:28 +0000 (UTC)
On Thu, 29 Oct 2009, Matt Hall wrote: > > Prior to r4177, the matching of tag names for exiting the RCDATA/RAWTEXT > states was done as follows: > > "...and the next few characters do no match the tag name of the last > start tag token emitted (compared in an ASCII case-insensitive manner)" > > However, the current revision doesn't include any comment on character > casing in its discussion of "Appropriate End Tags." Similarly, certain > tokenizer states require that you check the contents of the "temporary > buffer" against the string "script" but there is no indication of > whether or not to do this in a case-insensitive manner. > > In both cases, should this comparison be done in an ASCII > case-insensitive manner or not? It might be helpful to clarify the spec > in both places in either case. On Thu, 29 Oct 2009, Geoffrey Sneddon wrote: > > It is already case-insensitive as you lowercase the characters when > creating the token name and when adding them to the buffer. Indeed. On Fri, 30 Oct 2009, Matt Hall wrote: > > When the "script data" state was added to the tokenizer, the tree > construction algorithm was updated to switch the tokenizer into this > state upon finding a start tag named "script" while in the "in head" > insertion mode (9.2.5.7). I see that a corresponding change was not made > to 9.5 about "Parsing HTML Fragments" as it still says to switch into > the RAWTEXT state upon finding a "script" tag. Does anyone know if this > difference is intentional, or did someone just forget to update the > fragment parsing case? There's a comment now mentioning this explicitly. Is it ok? On Tue, 10 Nov 2009, Kartikaya Gupta wrote: > > If you have a page like this: > > <!DOCTYPE HTML> > <html><body> > <font size="2" face="Verdana"> > <p align="left">Some text > <font size="2" face="Verdana"> > <p align="left">Some text > </body></html> > > according to the HTML5 parser rules, I believe this should create a DOM with 3 font elements that looks something like this: > > <!DOCTYPE HTML><HTML><HEAD></HEAD><BODY> > <FONT face="Verdana" size="2"> > <P align="left">Some text > <FONT face="Verdana" size="2"> > </FONT></P><P align="left"><FONT size="2" face="Verdana">Some text > > </FONT></P></FONT></BODY></HTML> > > However, if you add extend the original source with another font/p combination, like so: > > <!DOCTYPE HTML> > <html><body> > <font size="2" face="Verdana"> > <p align="left">Some text > <font size="2" face="Verdana"> > <p align="left">Some text > <font size="2" face="Verdana"> > <p align="left">Some text > </body></html> > > You end up with a DOM which has 6 font elements: > > <!DOCTYPE HTML><HTML><HEAD></HEAD><BODY> > <FONT face="Verdana" size="2"> > <P align="left">Some text > <FONT face="Verdana" size="2"> > </FONT></P><P align="left"><FONT size="2" face="Verdana">Some text > <FONT face="Verdana" size="2"> > </FONT></FONT></P><P align="left"><FONT face="Verdana" size="2"><FONT size="2" face="Verdana">Some text > > </FONT></FONT></P></FONT></BODY></HTML> > > .. and so on. In general the number of font elements in the DOM grows > polynomially, with the result that pages like [1] and [2] end up with > hundreds of thousands of font elements. I haven't even been able to > successfully parse [3] with either our own HTML5 parser or the one at > validator.nu, it just gobbles up all available memory and asks for more. > > [1] http://www.miprepzone.com/past.asp?Category=%27news%27 > [2] http://info4.juridicas.unam.mx/ijure/tcfed/8.htm?s= > [3] http://info4.juridicas.unam.mx/ijure/tcfed/1.htm?s= > > Is this behavior expected, or is it a bug in the spec? Obviously > shipping browsers don't demonstrate this behavior (nor does Firefox's > HTML5 parser - see bugzilla 525960) so I'm wondering if the spec could > be modified to not have this polynomial-growth behavior. I haven't checked if the exact behaviour you describe is what the spec currently requires, but in general, there will always be cases where input has a disproportional result on output, because backwards-compatible fixup is basically contrained to very few possibilities, all of which have this behaviour in certain cases. In practice, it's not a huge issue, because you have to cope with these cases even just to handle regular valid documents -- consider for example an infinite document whose body is just <font><font><font><font>... with no close tags. There are a number of pages on the Web that approximate this on the Web, for example: http://www.frikis.org/images/ascii/tux.html On Tue, 24 Nov 2009, Daniel Glazman wrote: > > I think that insertAdjacentHTML as defined in current section 3.5.7 [1] > could be much cleaner and clearer if > > 1 - "Adjacent" was dropped. It's useless. The name could be insertHTML. > > 2. if the values were "before", "firstchild", "lastchild", after" > instead of the current "beforebegin", "afterbegin", "beforend" and > "afterend" that seem to me visually related to start and end tags > and not the element itself. Consistency with the existing DOM > phraseology seems to me useful. > > [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/apis-in-html-documents.html#insertadjacenthtml%28%29 On Tue, 24 Nov 2009, Anne van Kesteren wrote: > > The problem is that it is a legacy feature, much like innerHTML. On Tue, 24 Nov 2009, Daniel Glazman wrote: > > That's not a problem. Make insertHTML with the new values and make > insertAdjacentHTML with the old values just an alias to the new ones. Or > the contrary. Or whatever. But it's not because it's shipped by MS that > way that we must stick forever to such an horrible definition... On Tue, 24 Nov 2009, Anne van Kesteren wrote: > > That is actually pretty much how we do it for every feature (consider > e.g. XMLHttpRequest) because otherwise we have to duplicate too much > functionality which only increases complexity. I.e. more tests, more > APIs floating around, more documentation, backwards compatibility issues > (new duplicate APIs won't work, but the ones they reflect do), etc. On Wed, 25 Nov 2009, Dean Edwards wrote: > > Adding aliases does not reduce the horribleness of an API. On Wed, 25 Nov 2009, Daniel Glazman wrote: > > Correct. But at least it eases a bit the pain and allows future > deprecation. We can basically never drop anything, so aliases in practice don't really help. I haven't added an alias here, as I don't see much advantage to doing so. The proposed alternative names aren't much better. On Wed, 2 Dec 2009, ATSUSHI TAKAYAMA wrote: > > This was posted by Akatsuki Kitamura on the W3C Japanese Interest Group > Mailing List. > > [...quoting the syntax section...] > > As far as I understand, if I want to write a void element with no > attribute, such as the br, I do steps 1 ("<" character) and 2 (tag > name), then ignore 3 and 4. In the step 5, since I don't have any > attributes, the "after the attribute" situation does not apply here, so > I ignore it too. Then I close the tag by going through step 6 ("/" > character) and step 7 (">" character). > > Akatsuki's question was that if you write space characters before > closing the tag like the following, if they are still valid or not. > > <br > > <br /> > > I think the step 5 should be written as; > > After the attributes, or after the element's tag name if there are no > attributes, then there may be one or more space characters. Done. On Tue, 9 Feb 2010, Biju wrote: > > What should a user agent display when html content is... > > <html><body> > <%@ page language="java" %> > </body></html> > > [...] On Tue, 9 Feb 2010, Tab Atkins Jr. wrote: > > All of these cases appear to be an ASP or PHP page that is accidentally > being sent as ordinary html. You shouldn't be seeing these tags at all > in the source of the page unless a server is misconfigured. > > That said, given that you *are* seeing them, I'm not certain what the > correct behavior is, but it's definitely strictly defined in HTML5. Can > someone else with more familiarity with the parser algorithm help out > here? On Wed, 10 Feb 2010, Boris Zbarsky wrote: > > For the "<%@" case, it looks like the state machine will go through the > following states: > > Data state -> Tag open state > > When encountering a '%' in the "Tag open" state, the specification says: > > Parse error. Emit a U+003C LESS-THAN SIGN character token > and reconsume the current input character in the data state. > > So the state will then remain "Data state" until the next '&' or '<' or EOF is > seen, so the entire string up to the </body> will be treated as literal text. > > For the "<?" case, the state transitions will be: > > Data state -> Tag open state -> Bogus comment state > > Then the specification says to: > > Consume every character up to and including the first U+003E > GREATER-THAN SIGN character (>) or the end of the file (EOF), > whichever comes first. Emit a comment token whose data is the > concatenation of all the characters starting from and including > the character that caused the state machine to switch into the bogus > comment state, up to and including the character immediately before > the last consumed character (i.e. up to the character just before the > U+003E or EOF character). (If the comment was started by the end of > the file (EOF), the token is empty.) > > Switch to the data state. > > Or in other words, stop the bogus comment at the first '>' you see and > then start parsing normally again. In this case, that means treating > everything up to the next '<' or '&' or EOF as literal text. > > So the currently-specified behavior in fact matches the observed Firefox > behavior (with either parser) on these simple testcases. Sounds right. On Wed, 10 Feb 2010, Biju wrote: > > At least in one page I saw, which was Case 1 and page was originally > from a JSP or ASP template later modified and saved as a *.html I recommend fixing the page. :-) > So will IE and Safari (may be chrome also, i have not tested it) follow > Firefox way? Hard to say. You'd have to ask Microsoft. > Personally I prefer the IE way as I think one may able to make a simple > PHP or JSP editor just using contentEditable feature. Unfortunately the <%...%> stuff wouldn't round-trip correctly, since there's no way to represent it in the DOM. So you couldn't really make a PHP or JSP editor using contentEditable that way. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 10 February 2010 18:40:28 UTC