[whatwg] Unsafe SGML minimizations from Ian Hickson on 2006-03-11 (public-whatwg-archive@w3.org from March 2006)

From: Ian Hickson <ian@hixie.ch>
Date: Sat, 11 Mar 2006 00:08:09 +0000 (UTC)
Message-ID: <Pine.LNX.4.62.0603110000200.315@dhalsim.dreamhost.com>
On Thu, 8 Sep 2005, Henri Sivonen wrote:
> > 
> > I think it's pretty much guarenteed that HTML5's parsing model will be 
> > able to generate DOMs that can't be serialised to conformant XML 
> > syntax without dataloss.
> 
> I am assuming that those situations do not arise if the document is 
> conforming and the loss of details that are lost in XML c14n does not 
> count as data loss. It would be very nice if you defined conformance in 
> such a way that this assumption held true. :-)

Yes, conformant documents will be such that a conformant HTML5 document 
can always be serialised to a conforming XHTML5 document, I think. If that 
ever turns out not to be the case, please raise the issue! I think this is 
important because people use XML tools then serialise to HTML, and vice 
versa (e.g. with CMSes that store data in custom formats).


> > For example, the list of characters that must be recognised as part of 
> > an element or attribute name when hitting an unknown element or 
> > attribute is bigger than the list of characters XML allows.
> 
> For the purpose of conformance checking, I've gone the other way and 
> limited names to ASCII. I think that's OK, because conforming names are 
> ASCII. However, I expect that I will have to polish the code that looks 
> for unquoted attribute values. (But I think conforming unquoted 
> attribute values should not include values that weren't SGML-valid in 
> HTML 4.)

As specced, unquoted values can contain pretty much anything.


> > Similarly, a comment in HTML can contain the string "--" (assuming it 
> > comes in pairs), while an XML comment cannot. This latter example even 
> > affects conforming documents.
> 
> From the HTML-as-SGML point of view, there are two comments in <!-- foo 
> -- -- bar -- >, so it would be quite appropriate to convert it into XML 
> as <!-- foo --><!-- bar -->. This reasoning does not quite work for 
> faithfully converting HTML-as-soup.

That's certainly one way to handle it.


> > I've been looking at misnested tags recently (hence my replying to 
> > this e-mail despite normally archiving the e-mails about HTML parsing 
> > so that I can get back to them when I start work on that part of the 
> > spec). I assume, based on the line of reasoning that you've been 
> > describing above, that you would agree with me that we should forego 
> > compatibility with IE in the DOM it forms in response to markup such 
> > as:
> > 
> >    <body> <form> <div> </form> TEXT NODE </div> </body>
> > 
> > What IE does in this case is make the TEXT NODE's parent be the <div> 
> > and its previous sibling be the <form>.
> > 
> > What browsers do tends to vary; but with markup such as the above 
> > Firefox and Safari interoperate on saying that the </form> is ignored 
> > and the form instead continues up to the </body>. However, the exact 
> > opposite:
> > 
> >    <body> <div> <form> </div> TEXT NODE </form> </body>
> > 
> > ...does not do the opposite in those browsers, despite (in IE) the DOM 
> > being equivalent to the previous case. Here, the </div> is not 
> > ignored, it implies the </form> and the TEXT NODE ends up a child of 
> > <body>.
> 
> I think it is reasonable to force the DOM into a tree, which necessarily 
> means not doing what IE does in some cases.

Agreed. In the case above, I've gone with IE's closing of <form>, so the 
rendering would be more IE-compatible, but the DOM is a tree.


> Also, I think a conformance checker should only have to observe the top 
> of the open element stack when deciding what to do with an end tag. That 
> is, popping due to non-matching end tag would always be opportunistic 
> (possibly leading to an error if a matching start is not found).

Yeah, I think the way the spec is defined you can implement a conformance 
checker without looking anywhere but the end of the stack. But you'd only 
be able to catch one error at a time.


> However, I assume there may be non-conforming cases where browsers would 
> want to peek deeper in the stack before deciding whether to discard a 
> misnested end tag or pop until the start tag is found (ie. only pop if 
> the start was actually found when peeking deeper in the stack). 
> Additional testing and/or reading of source would be needed for 
> determining if such deep peeking is happening here or if popping the 
> 'form' on </div> is opportunistic. (But </form> apparently causes 
> neither deep peeking nor opportunistic popping.)

There are cases where you have to do surgery to the middle of the stack. 
So yeah, full implementations would have a lot more work to do.


> > Trying to work out all the various cases is giving me a headache...
> 
> Then I hope you sympathize with my selfish desire to get conformance 
> checkers exempt from error recovery (ie. allowing them to stop upon 
> finding an error).

Hey, now that I've done the work, I want y'all to suffer too. :-P

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 10 March 2006 16:08:09 UTC