W3C home > Mailing lists > Public > public-html-xml@w3.org > December 2010

Re: What problem is this task force trying to solve and why?

From: David Carlisle <davidc@nag.co.uk>
Date: Mon, 20 Dec 2010 15:50:55 +0000
Message-ID: <4D0F7B5F.4090201@nag.co.uk>
To: Henri Sivonen <hsivonen@iki.fi>
Cc: public-html-xml@w3.org
On 20/12/2010 14:53, Henri Sivonen wrote:
> On Dec 18, 2010, at 19:39, David Carlisle wrote:
>> a well formed fragment such as:
>>    aaa<math><b>aaa</b><mtext>bbb</mtext></math>
>> parses as
>>    aaa<math></math><b>aaa</b><mtext>bbb</mtext>
>> with the math element being forced closed, and the tree completely re-arranged.
>> no previous version of html specified this, and no browser did this until very recently
>> as gecko and webkit started following the html5 algorithm.

> I don't recall this being a common complaint, but I recall you
> mentioning this before. The parsing algorithm is designed not to
> break weird stuff that exists on the Web, such as the content
> depicted in http://junkyard.damowmow.com/339 . The idea is to make
> implementing foreign content as low-risk as possible in terms of
> impact on the rendering of existing content. Hixie searched Google's
> index for HTML content that already contained an<svg>  tag or a<math>
> tag and designed the algorithm not to significantly break the
> rendering of those pages. So far this has been a success in the sense
> that I haven't seen a single bug report about Firefox 4 breaking a
> pre-existing site because of the introduction of the foreign content
> feature.

It is not likely to be a common complaint yet as mathml-in-html isn't 
yet implemented in any full release browser, so the only people who 
would have been likely to complain are people with an interest in mathml 
or svg and have read the the html5 parsing spec in some detail.
this cuts down the audience somewhat.

This behaviour really has no justification. If someone was using a 
(previously undefined) <math>...</math> wrapper around html, they were 
presumably using it for a reason in particular to style the math using 
css, thus having the html be moved out of the math would not work for 
those cases either. So it neither preserves any existing behaviour nor
produces a desirable behaviour going forward. I am sure you can find 
sites that wrapped html in <math> but still work if they are not wrapped 
but this isn't really any justification.

>> The other problem has been more widely discussed (and the issues are more complex) but
>> aaa<div/>bbb
>> being parsed as a start tag with bbb inside the div is going to cause confusion forever.
>> HTML4 and XML specified different parsing rules, so your above argument might have been used
>> to say that the html parsing shouldn't change. However HTML5 has changed the parsing here
>> (to be bug compatible with common browsers)
> HTML5 hasn't changed parsing here compared to how browsers have behaved since before XML existed.

as I noted.

>> but being incompatible with editors and validators
>> using nsgmls or other parsers that did implement HTML4 as specified.
> Compatibility with SGML parsers doesn't really matter. The only
> notable SGML parser-based HTML consumer is the W3C Validator and it is
> made obsolete by HTML5 due to other reasons anyway.

Editing tools also use nsgmls (perhaps just in the background) It isn't 
really true to say it is "just the w3c validator".

>> To introduce new parsing rules for />  at this stage but to make it so incompatible with XML is very hard to understand.

> HTML5 doesn't introduce new parsing rules in this case (except for
> foreign content). It documents how things have always been in
> reality. (Previous HTML specs that pretended HTML was an application
> of SGML were out of touch with reality. HTML has always been a
> standalone SGML-inspired language but not an application of SGML for
> practical purposes.)

anyone (or more to the point any tool) that is using /> is almost 
certainly generating an empty element (because the syntax was not used 
until xml introduced it for that) Because people have produced <p/> or 
whatever and found it didn't work as expected I an sure you can find 
cases where the document is then "corrected" ending up with <p/>...<//p>
or something with <p/> acting as a start tag so browser behaviour would 
change, but it would be better for everyone if a way could be found to 
make this syntax work without breaking old content.  The use of xml 
syntax in text/html is very common, many content management systems do 
it, the W3C home page does it, and this behaviour makes the practice 
incredibly fragile. (As the W3C found out to its cost when it attempted 
to restyle its existing Recommendations as xml served as text/html and 
found they all broke, for (only) this reason, with <a id="foo"/> being 
parsed as a start tag and then being closed and repeatedly re-opened 
resulting in whole paragraphs being styled as links and ids being repeated.

The html5 parser already has a flag to turn off this behaviour "foreign 
content" I think it would be good to be able to have a flag to allow 
"foreign content" style parsing for the html parts as well.
personally I'd use the new doctype <!doctype html> as that flag, but 
there are other possibilities.

> I think it's possible (even probable) that we will arrive at the
> conclusion that both HTML and XML are too widely deployed to change
> either.

As you say that is a possibility, in which case the end result is 
something like the current polyglot spec which tries to document the 
regrettably small areas of overlap. But with some goodwill hopefully a 
more functional overlap could be found.

> On Dec 19, 2010, at 20:20, David Carlisle wrote:

>> but it was a very tortuous process that got us to a state where it
was possible to have mathml annotation-xml that could contain html
(basically as finally specced the parsing of annotation-xml as html or
"foreign content" depends on the value of an attribute, which is
workable but less than ideal.

> How was the process tortuous? I thought the interactions with the
> Math WG went very nicely. As for the<annotation-xml> change in
> particular, think the pushback from Hixie and me was much milder than
> one could have expected for a change of that kind to the parsing
> algorithm.

I hope we/I have a reasonable working relationship with the html group, 
but that doesn't mean we don't think that you are wrong on some issues.
(I know you think I'm wrong on lots of issues)

The fact that the parser got specified that way in the first place was 
fairly shocking, and the fact that it wasn't just immediately accepted 
as a bug was fairly shocking too. The fact that you describe this as 
"mild pushback" for a "change of that kind" I think is indicative of the 
different world views that are in play here. That is the kind of 
language one would use for a late request for enhancement, that was 
weighed up and allowed to go in at the last minute, not language you 
would use to describe fixing a critical bug. The fact that the final 
resolution chosen, to add another special case parse rule based on a 
special value in one specified attribute on one specified mathml element 
is I think indicative of the problems that lead to the lack of html/xml 

The fact that html elements in foreign content abort the foreign content 
is a generic problem with the html5 parsing algorithm: it will bite any 
use of xml in html. The resolution of the bug just adds a workaround for 
the special case of mathml. In the html5 world view that isn't a problem 
because xml shouldn't be let loose on the web except for the special 
cases of mathml and svg (and the issues with svg are a bit different). 
That is a position that is defensible (and I'm sure that you will defend 
it with some force:-) but it is I think the root cause of the perceived 
divergence of xml and html.


The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
Received on Monday, 20 December 2010 15:51:31 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 20 December 2010 15:51:31 GMT