- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 8 Sep 2005 23:39:54 +0300
On Sep 8, 2005, at 19:03, Ian Hickson wrote: > On Thu, 8 Sep 2005, Henri Sivonen wrote: >> On Sep 8, 2005, at 17:26, Ian Hickson wrote: >> >>> On Thu, 8 Sep 2005, Henri Sivonen wrote: >>>> * tagc omission ie. <foo<bar>...</bar</foo> >>> >>> Well we have to define what that does, and the most obvious error >>> handling >>> behaviour here is to start the new tag. So effectively, I would say >>> we >>> shoul have TAGC omission. >> >> But it would still be an error as far as a conformance checker is >> concerned, right? > > I don't have an opinion on that either way. I guess it seems > reasonable to > make it an error. At this point I'm more worried about getting the UA > rules down before worrying about what the author can or can't do. I view conformance checking as an authoring aid that is supposed to help authors make pages that work. Therefore, if there is syntactic sugar that is known to cause problems in real browsers, it would be helpful if a conformance checker flagged it as an error. To help conformance checker developers avoid having to endlessly defend their subjective judgment against people who want to keep their errors but argue them right ( http://diveintomark.org/archives/2004/08/16/specs ), it would be nice if such bad syntactic sugar was proclaimed non-conforming in the spec (even if unambiguous error handling was defined). Tagc omission breaks in current Opera, which makes tagc omission bad syntactic sugar from the practical point of view. >> I think the HTML5 spec should allow TagSoup to be updated for HTML5 >> or an >> equivalent of TagSoup for HTML5 to be written. TagSoup guarantees to >> the >> application that it acts as if it was an XML parser parsing XHTML. >> Therefore, >> XML and, by extension, the SAX2 API contract restrict the attribute >> names to >> legal XML attribute names. If HTML5 required "/bar/" to be reported >> as an >> attribute name, TagSoup would have to violate that constraint and >> could not >> claim conformance. > > I think it's pretty much guarenteed that HTML5's parsing model will be > able to generate DOMs that can't be serialised to conformant XML syntax > without dataloss. I am assuming that those situations do not arise if the document is conforming and the loss of details that are lost in XML c14n does not count as data loss. It would be very nice if you defined conformance in such a way that this assumption held true. :-) > For example, the list of characters that must be recognised as part of > an > element or attribute name when hitting an unknown element or attribute > is > bigger than the list of characters XML allows. For the purpose of conformance checking, I've gone the other way and limited names to ASCII. I think that's OK, because conforming names are ASCII. However, I expect that I will have to polish the code that looks for unquoted attribute values. (But I think conforming unquoted attribute values should not include values that weren't SGML-valid in HTML 4.) > Similarly, a comment in > HTML can contain the string "--" (assuming it comes in pairs), while an > XML comment cannot. This latter example even affects conforming > documents. From the HTML-as-SGML point of view, there are two comments in <!-- foo -- -- bar -- >, so it would be quite appropriate to convert it into XML as <!-- foo --><!-- bar -->. This reasoning does not quite work for faithfully converting HTML-as-soup. I am dodging this issue by parsing as if HTML-as-SGML was the case here syntactically and not reporting comment parse events at all. Reporting comments to the app is optional in XML and Jing wouldn't want to listen to comment parse events anyway. (In fact, I think there'd be an architectural bug if it wanted.) FWIW, Opera, Deer Park and Safari all represent this case differently in the DOM. Opera includes the "--" after "bar" in the value. Deer Park does not. Safari does not include comments in the DOM at all. >>>> * attribute name omission (except for the well-known "boolean >>>> attributes") >>> >>> Again, we have to define error handling. <foo bar baz> will probably >>> just >>> be equivalent to <foo bar="" baz="">. >> >> I have previously argued for <foo bar="bar" baz="baz"> in the >> TagSoup-like scenario, because that would be the same as the treatment >> required for the "boolean attributes". > > That wouldn't be backwards compatible, IIRC. OK. I intend to just throw an error on non-boolean minimized attributes. > I've been looking at misnested tags recently (hence my replying to this > e-mail despite normally archiving the e-mails about HTML parsing so > that I > can get back to them when I start work on that part of the spec). I > assume, based on the line of reasoning that you've been describing > above, > that you would agree with me that we should forego compatibility with > IE > in the DOM it forms in response to markup such as: > > <body> <form> <div> </form> TEXT NODE </div> </body> > > What IE does in this case is make the TEXT NODE's parent be the <div> > and > its previous sibling be the <form>. > > What browsers do tends to vary; but with markup such as the above > Firefox > and Safari interoperate on saying that the </form> is ignored and the > form > instead continues up to the </body>. However, the exact opposite: > > <body> <div> <form> </div> TEXT NODE </form> </body> > > ...does not do the opposite in those browsers, despite (in IE) the DOM > being equivalent to the previous case. Here, the </div> is not > ignored, it > implies the </form> and the TEXT NODE ends up a child of <body>. I think it is reasonable to force the DOM into a tree, which necessarily means not doing what IE does in some cases. Also, I think a conformance checker should only have to observe the top of the open element stack when deciding what to do with an end tag. That is, popping due to non-matching end tag would always be opportunistic (possibly leading to an error if a matching start is not found). However, I assume there may be non-conforming cases where browsers would want to peek deeper in the stack before deciding whether to discard a misnested end tag or pop until the start tag is found (ie. only pop if the start was actually found when peeking deeper in the stack). Additional testing and/or reading of source would be needed for determining if such deep peeking is happening here or if popping the 'form' on </div> is opportunistic. (But </form> apparently causes neither deep peeking nor opportunistic popping.) > Trying to work out all the various cases is giving me a headache... Then I hope you sympathize with my selfish desire to get conformance checkers exempt from error recovery (ie. allowing them to stop upon finding an error). -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Thursday, 8 September 2005 13:39:54 UTC