- From: Ian Hickson <ian@hixie.ch>
- Date: Sat, 11 Mar 2006 00:08:09 +0000 (UTC)
On Thu, 8 Sep 2005, Henri Sivonen wrote: > > > > I think it's pretty much guarenteed that HTML5's parsing model will be > > able to generate DOMs that can't be serialised to conformant XML > > syntax without dataloss. > > I am assuming that those situations do not arise if the document is > conforming and the loss of details that are lost in XML c14n does not > count as data loss. It would be very nice if you defined conformance in > such a way that this assumption held true. :-) Yes, conformant documents will be such that a conformant HTML5 document can always be serialised to a conforming XHTML5 document, I think. If that ever turns out not to be the case, please raise the issue! I think this is important because people use XML tools then serialise to HTML, and vice versa (e.g. with CMSes that store data in custom formats). > > For example, the list of characters that must be recognised as part of > > an element or attribute name when hitting an unknown element or > > attribute is bigger than the list of characters XML allows. > > For the purpose of conformance checking, I've gone the other way and > limited names to ASCII. I think that's OK, because conforming names are > ASCII. However, I expect that I will have to polish the code that looks > for unquoted attribute values. (But I think conforming unquoted > attribute values should not include values that weren't SGML-valid in > HTML 4.) As specced, unquoted values can contain pretty much anything. > > Similarly, a comment in HTML can contain the string "--" (assuming it > > comes in pairs), while an XML comment cannot. This latter example even > > affects conforming documents. > > From the HTML-as-SGML point of view, there are two comments in <!-- foo > -- -- bar -- >, so it would be quite appropriate to convert it into XML > as <!-- foo --><!-- bar -->. This reasoning does not quite work for > faithfully converting HTML-as-soup. That's certainly one way to handle it. > > I've been looking at misnested tags recently (hence my replying to > > this e-mail despite normally archiving the e-mails about HTML parsing > > so that I can get back to them when I start work on that part of the > > spec). I assume, based on the line of reasoning that you've been > > describing above, that you would agree with me that we should forego > > compatibility with IE in the DOM it forms in response to markup such > > as: > > > > <body> <form> <div> </form> TEXT NODE </div> </body> > > > > What IE does in this case is make the TEXT NODE's parent be the <div> > > and its previous sibling be the <form>. > > > > What browsers do tends to vary; but with markup such as the above > > Firefox and Safari interoperate on saying that the </form> is ignored > > and the form instead continues up to the </body>. However, the exact > > opposite: > > > > <body> <div> <form> </div> TEXT NODE </form> </body> > > > > ...does not do the opposite in those browsers, despite (in IE) the DOM > > being equivalent to the previous case. Here, the </div> is not > > ignored, it implies the </form> and the TEXT NODE ends up a child of > > <body>. > > I think it is reasonable to force the DOM into a tree, which necessarily > means not doing what IE does in some cases. Agreed. In the case above, I've gone with IE's closing of <form>, so the rendering would be more IE-compatible, but the DOM is a tree. > Also, I think a conformance checker should only have to observe the top > of the open element stack when deciding what to do with an end tag. That > is, popping due to non-matching end tag would always be opportunistic > (possibly leading to an error if a matching start is not found). Yeah, I think the way the spec is defined you can implement a conformance checker without looking anywhere but the end of the stack. But you'd only be able to catch one error at a time. > However, I assume there may be non-conforming cases where browsers would > want to peek deeper in the stack before deciding whether to discard a > misnested end tag or pop until the start tag is found (ie. only pop if > the start was actually found when peeking deeper in the stack). > Additional testing and/or reading of source would be needed for > determining if such deep peeking is happening here or if popping the > 'form' on </div> is opportunistic. (But </form> apparently causes > neither deep peeking nor opportunistic popping.) There are cases where you have to do surgery to the middle of the stack. So yeah, full implementations would have a lot more work to do. > > Trying to work out all the various cases is giving me a headache... > > Then I hope you sympathize with my selfish desire to get conformance > checkers exempt from error recovery (ie. allowing them to stop upon > finding an error). Hey, now that I've done the work, I want y'all to suffer too. :-P -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 10 March 2006 16:08:09 UTC