Re: What problem is this task force trying to solve and why? from Henri Sivonen on 2010-12-20 (public-html-xml@w3.org from December 2010)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 20 Dec 2010 16:53:42 +0200
To: public-html-xml@w3.org
Message-Id: <576E3D41-FD01-4D94-8B45-3E79A0C76A13@iki.fi>
On Dec 18, 2010, at 19:39, David Carlisle wrote:

> a well formed fragment such as:
> 
>   aaa<math><b>aaa</b><mtext>bbb</mtext></math>
> 
> parses as
> 
>   aaa<math></math><b>aaa</b><mtext>bbb</mtext>
> 
> 
> 
> with the math element being forced closed, and the tree completely re-arranged.
> 
> no previous version of html specified this, and no browser did this until very recently
> as gecko and webkit started following the html5 algorithm.

I don't recall this being a common complaint, but I recall you mentioning this before. The parsing algorithm is designed not to break weird stuff that exists on the Web, such as the content depicted in http://junkyard.damowmow.com/339 . The idea is to make implementing foreign content as low-risk as possible in terms of impact on the rendering of existing content. Hixie searched Google's index for HTML content that already contained an <svg> tag or a <math> tag and designed the algorithm not to significantly break the rendering of those pages. So far this has been a success in the sense that I haven't seen a single bug report about Firefox 4 breaking a pre-existing site because of the introduction of the foreign content feature.

> The other problem has been more widely discussed (and the issues are more complex) but
> 
> aaa<div/>bbb
> 
> being parsed as a start tag with bbb inside the div is going to cause confusion forever.
> 
> HTML4 and XML specified different parsing rules, so your above argument might have been used 
> to say that the html parsing shouldn't change. However HTML5 has changed the parsing here
> (to be bug compatible with common browsers)

HTML5 hasn't changed parsing here compared to how browsers have behaved since before XML existed.

> but being incompatible with editors and validators
> using nsgmls or other parsers that did implement HTML4 as specified. 

Compatibility with SGML parsers doesn't really matter. The only notable SGML parser-based HTML consumer is the W3C Validator and it is made obsolete by HTML5 due to other reasons anyway.

> To introduce new parsing rules for /> at this stage but to make it so incompatible with XML is very hard to understand.

HTML5 doesn't introduce new parsing rules in this case (except for foreign content). It documents how things have always been in reality. (Previous HTML specs that pretended HTML was an application of SGML were out of touch with reality. HTML has always been a standalone SGML-inspired language but not an application of SGML for practical purposes.)

On Dec 18, 2010, at 19:46, Michael Champion wrote:

> So, I'm
> definitely here wearing an HTML hat, and agree with Henri that it's not
> going to be productive to suggest large changes to HTML5, especially to
> make it more XML-like.

I'm glad you agree.

> At TPAC I heard use cases about problems caused by
> some of the *details* of the HTML5 parsing algorithm that create very
> different infosets than an XHTML parser would.  Cataloging such problems
> and brainstorming solutions seems very much in scope for this TF.

For cataloguing, previous work includes
http://wiki.whatwg.org/wiki/HTML_vs._XHTML

> During my time on the XML team at Microsoft, I learned that XML is very
> widely used, but it is *infrastructure*, figuratively buried under the
> floor.  Very few people are aware that what happens when they plug an
> external device their computer or start their car depends heavily on
> XML...and that's just fine.  The last thing on earth the XML community
> should want to do is make them aware of that dependence by breaking the
> infrastructure. 

I think it's possible (even probable) that we will arrive at the conclusion that both HTML and XML are too widely deployed to change either.

On Dec 18, 2010, at 20:58, Michael Kay wrote:

> Incorrect XML can come from two places: hand-authored XML, or XML generated by buggy software. I wouldn't expect to see much hand-authored XML, and most of what there is, I would expect to be generated by editing tools that get it right.

It seems to me that a notable source of the YSoD is content management systems whose templates and output snippets are hand-written. I.e. systems that don't run an XML serializer.

> I would have said a much bigger factor in "XML's failure" (on the client) was that it's only been since about 2008 that there's been reasonably adequate support for XML processing across all the browsers, and by then the window of opportunity had passed by.

I think "all browsers" is still to come only once IE9 is released. Non-IE browsers reached a usable state earlier than 2008.

On Dec 19, 2010, at 20:20, David Carlisle wrote:

> but it was a very tortuous process that got us to a state where it was possible to have mathml annotation-xml that could contain html (basically as finally specced the parsing of annotation-xml as html or "foreign content" depends on the value of an attribute, which is workable but less than ideal.

How was the process tortuous? I thought the interactions with the Math WG went very nicely. As for the <annotation-xml> change in particular, I think the pushback from Hixie and me was much milder than one could have expected for a change of that kind to the parsing algorithm.

On Dec 19, 2010, at 20:43, Michael Champion wrote:

> Agree in principle but the "DOM sux" argument applies to HTML as well.
> Clearly Dynamic HTML / AJAX / HTML5 Web Apps weren't crippled by the pain
> of Javascript + DOM, so I'm not sure why that would have hurt XML on the
> web worse than it hurt HTML.

I agree. The DOM is flawed in many ways, but it doesn't make sense to attribute XML's lack of success in the browser space to the DOM.

On Dec 20, 2010, at 05:56, James Clark wrote:

> I understand our goal is "convergence" HTML of XML.  What would constitute convergence?

But why should HTML and XML converge? Who is expected to benefit and how? (I guess a discussion about balancing that benefit with the cost is for later.)

HTML5 already unified the data models as much as was feasible.

> The idea is to make polyglot documents a solid, reliable, workable approach.  HTML5 in the HTML syntax could be processed by XML tools like a normal XML vocabulary, provided only that the XML tools know about the extra constraints of convergent well-formedness.

I think finding out what the "polyglot" intersection is is a spec lawyering exercise that has puzzle appeal, but I totally fail to see how polyglot documents would be a better solution for any practical problem that I've seen than HTML parsers and serializers that plug into XML tools by exposing XML APIs. 

If one is writing a program for consuming arbitrary Web content, the program needs to contain an HTML parser, because most web content is HTML and it's not possible to make the authors of all Web content to produce polyglot markup going forward let alone get them to change existing content. Thus, to be able to use an XML parser to consume HTML content one has to be in control of the content (to make it polyglot) in addition to being in control of the program consuming it. However, if one is controlling both the content and the program consuming it, one might as well choose to use HTML syntax and an HTML parser or XML syntax and an XML parser. Why self-impose a mix? (I don't buy the learnability argument. Staying within the Appendix C was hard for text/html authors who tried to self-impose it.)

It seems to me that the fundamental underlying assumption of the value of polyglot markup is that HTML parsers are scarce and at the same time the author is going to want to maintain content in a format that can be delivered as text/html without preprocessing (because of IE < 9?). I believe HTML parsers that implement the algorithm specified in HTML5 will be available off the shelf for various programming languages just like XML parsers are today. Thus, I think it doesn't make sense to design for the assumption that people won't have both HTML and XML parsers available. (This will probably happen sooner than IE < 9 will have dropped in market share so much that authors no longer care about the old versions, but eventually polyglot will be less interesting also because authors who want to write XML will be able to serve XHTML to browsers across the board.)

As for serialization, a generic XML serializer most likely won't produce polyglot results by chance. People will need a serializer that has been designed for text/html output. However, once you are targeting a serializer to text/html, it is enough to make the output conforming HTML5. There doesn't seem to be an additional benefit from restricting output to being polyglot, since, as noted above, everyone consuming the output of the serializer will have an HTML parser anyway.

> Also I think we should look at the HTML5 distributed extensibility issue
> 
> http://www.w3.org/html/wg/tracker/issues/41 

The issue has been through a poll per the Decision Process of the HTML WG. (Unfortunately, the chairs *still* haven't rendered a Decision.) I think it would be out of order to poke at the issue again.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Monday, 20 December 2010 14:54:19 UTC