- From: David Carlisle <davidc@nag.co.uk>
- Date: Mon, 20 Dec 2010 15:50:55 +0000
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: public-html-xml@w3.org
On 20/12/2010 14:53, Henri Sivonen wrote: > On Dec 18, 2010, at 19:39, David Carlisle wrote: > >> a well formed fragment such as: >> >> aaa<math><b>aaa</b><mtext>bbb</mtext></math> >> >> parses as >> >> aaa<math></math><b>aaa</b><mtext>bbb</mtext> >> >> >> >> with the math element being forced closed, and the tree completely re-arranged. >> >> no previous version of html specified this, and no browser did this until very recently >> as gecko and webkit started following the html5 algorithm. > > I don't recall this being a common complaint, but I recall you > mentioning this before. The parsing algorithm is designed not to > break weird stuff that exists on the Web, such as the content > depicted in http://junkyard.damowmow.com/339 . The idea is to make > implementing foreign content as low-risk as possible in terms of > impact on the rendering of existing content. Hixie searched Google's > index for HTML content that already contained an<svg> tag or a<math> > tag and designed the algorithm not to significantly break the > rendering of those pages. So far this has been a success in the sense > that I haven't seen a single bug report about Firefox 4 breaking a > pre-existing site because of the introduction of the foreign content > feature. It is not likely to be a common complaint yet as mathml-in-html isn't yet implemented in any full release browser, so the only people who would have been likely to complain are people with an interest in mathml or svg and have read the the html5 parsing spec in some detail. this cuts down the audience somewhat. This behaviour really has no justification. If someone was using a (previously undefined) <math>...</math> wrapper around html, they were presumably using it for a reason in particular to style the math using css, thus having the html be moved out of the math would not work for those cases either. So it neither preserves any existing behaviour nor produces a desirable behaviour going forward. I am sure you can find sites that wrapped html in <math> but still work if they are not wrapped but this isn't really any justification. >> The other problem has been more widely discussed (and the issues are more complex) but >> >> aaa<div/>bbb >> >> being parsed as a start tag with bbb inside the div is going to cause confusion forever. >> >> HTML4 and XML specified different parsing rules, so your above argument might have been used >> to say that the html parsing shouldn't change. However HTML5 has changed the parsing here >> (to be bug compatible with common browsers) > > HTML5 hasn't changed parsing here compared to how browsers have behaved since before XML existed. as I noted. > >> but being incompatible with editors and validators >> using nsgmls or other parsers that did implement HTML4 as specified. > > Compatibility with SGML parsers doesn't really matter. The only > notable SGML parser-based HTML consumer is the W3C Validator and it is > made obsolete by HTML5 due to other reasons anyway. Editing tools also use nsgmls (perhaps just in the background) It isn't really true to say it is "just the w3c validator". >> To introduce new parsing rules for /> at this stage but to make it so incompatible with XML is very hard to understand. > > HTML5 doesn't introduce new parsing rules in this case (except for > foreign content). It documents how things have always been in > reality. (Previous HTML specs that pretended HTML was an application > of SGML were out of touch with reality. HTML has always been a > standalone SGML-inspired language but not an application of SGML for > practical purposes.) > anyone (or more to the point any tool) that is using /> is almost certainly generating an empty element (because the syntax was not used until xml introduced it for that) Because people have produced <p/> or whatever and found it didn't work as expected I an sure you can find cases where the document is then "corrected" ending up with <p/>...<//p> or something with <p/> acting as a start tag so browser behaviour would change, but it would be better for everyone if a way could be found to make this syntax work without breaking old content. The use of xml syntax in text/html is very common, many content management systems do it, the W3C home page does it, and this behaviour makes the practice incredibly fragile. (As the W3C found out to its cost when it attempted to restyle its existing Recommendations as xml served as text/html and found they all broke, for (only) this reason, with <a id="foo"/> being parsed as a start tag and then being closed and repeatedly re-opened resulting in whole paragraphs being styled as links and ids being repeated. The html5 parser already has a flag to turn off this behaviour "foreign content" I think it would be good to be able to have a flag to allow "foreign content" style parsing for the html parts as well. personally I'd use the new doctype <!doctype html> as that flag, but there are other possibilities. > I think it's possible (even probable) that we will arrive at the > conclusion that both HTML and XML are too widely deployed to change > either. As you say that is a possibility, in which case the end result is something like the current polyglot spec which tries to document the regrettably small areas of overlap. But with some goodwill hopefully a more functional overlap could be found. > On Dec 19, 2010, at 20:20, David Carlisle wrote: > >> but it was a very tortuous process that got us to a state where it was possible to have mathml annotation-xml that could contain html (basically as finally specced the parsing of annotation-xml as html or "foreign content" depends on the value of an attribute, which is workable but less than ideal. > How was the process tortuous? I thought the interactions with the > Math WG went very nicely. As for the<annotation-xml> change in > particular, think the pushback from Hixie and me was much milder than > one could have expected for a change of that kind to the parsing > algorithm. I hope we/I have a reasonable working relationship with the html group, but that doesn't mean we don't think that you are wrong on some issues. (I know you think I'm wrong on lots of issues) The fact that the parser got specified that way in the first place was fairly shocking, and the fact that it wasn't just immediately accepted as a bug was fairly shocking too. The fact that you describe this as "mild pushback" for a "change of that kind" I think is indicative of the different world views that are in play here. That is the kind of language one would use for a late request for enhancement, that was weighed up and allowed to go in at the last minute, not language you would use to describe fixing a critical bug. The fact that the final resolution chosen, to add another special case parse rule based on a special value in one specified attribute on one specified mathml element is I think indicative of the problems that lead to the lack of html/xml convergence. The fact that html elements in foreign content abort the foreign content is a generic problem with the html5 parsing algorithm: it will bite any use of xml in html. The resolution of the bug just adds a workaround for the special case of mathml. In the html5 world view that isn't a problem because xml shouldn't be let loose on the web except for the special cases of mathml and svg (and the issues with svg are a bit different). That is a position that is defensible (and I'm sure that you will defend it with some force:-) but it is I think the root cause of the perceived divergence of xml and html. David ________________________________________________________________________ The Numerical Algorithms Group Ltd is a company registered in England and Wales with company number 1249803. The registered office is: Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom. This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. ________________________________________________________________________
Received on Monday, 20 December 2010 15:51:31 UTC