Re: What problem is this task force trying to solve and why? from Henri Sivonen on 2011-01-04 (public-html-xml@w3.org from January 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 4 Jan 2011 15:07:58 +0200
To: public-html-xml@w3.org
Message-Id: <B04544A5-4089-497D-BB2A-31B40A23EF64@iki.fi>
On Dec 20, 2010, at 17:50, David Carlisle wrote:

> On 20/12/2010 14:53, Henri Sivonen wrote:
>> On Dec 18, 2010, at 19:39, David Carlisle wrote:
>> 
>>> a well formed fragment such as:
>>> 
>>>   aaa<math><b>aaa</b><mtext>bbb</mtext></math>
>>> 
>>> parses as
>>> 
>>>   aaa<math></math><b>aaa</b><mtext>bbb</mtext>
>>> 
>>> 
>>> 
>>> with the math element being forced closed, and the tree completely re-arranged.
>>> 
>>> no previous version of html specified this, and no browser did this until very recently
>>> as gecko and webkit started following the html5 algorithm.
>> 
> 
>> I don't recall this being a common complaint, but I recall you
>> mentioning this before. The parsing algorithm is designed not to
>> break weird stuff that exists on the Web, such as the content
>> depicted in http://junkyard.damowmow.com/339 . The idea is to make
>> implementing foreign content as low-risk as possible in terms of
>> impact on the rendering of existing content. Hixie searched Google's
>> index for HTML content that already contained an<svg>  tag or a<math>
>> tag and designed the algorithm not to significantly break the
>> rendering of those pages. So far this has been a success in the sense
>> that I haven't seen a single bug report about Firefox 4 breaking a
>> pre-existing site because of the introduction of the foreign content
>> feature.
> 
> It is not likely to be a common complaint yet as mathml-in-html isn't yet implemented in any full release browser, so the only people who would have been likely to complain are people with an interest in mathml or svg and have read the the html5 parsing spec in some detail.
> this cuts down the audience somewhat.

Sure, but I'd hope that audience to have been larger than one person (you).

> This behaviour really has no justification.

It sure has. Hixie ran an analysis over a substantial quantity of Web pages in Google's index and found existing text/html content that contained an <svg> tag or a <math> tag. The justification is making the algorithm not break a substantial quantity of pages like that.

> If someone was using a (previously undefined) <math>...</math> wrapper around html, they were presumably using it for a reason in particular to style the math using css, thus having the html be moved out of the math would not work for those cases either. So it neither preserves any existing behaviour nor
> produces a desirable behaviour going forward. I am sure you can find sites that wrapped html in <math> but still work if they are not wrapped but this isn't really any justification.

Web authors do all sorts of crazily bizarre things. It's really not useful to try to apply logic to try to reason what kind of existing content there should be.

If you want to contest Hixie's findings, I suggest running and publishing your own analysis of a substantial corpus of Web content. (By running the current HTML parsing algorithm instrumented to count how often <math> and <svg> are seen and how often the break-out behavior is triggered.) Unfortunately, people who don't have access to Google's or Bing's index are not in as good a position to run such analyses as Hixie, but running an analysis over dotbot data (http://www.dotnetdotcom.org/) goes a long way further than mere conjecture.

(To be clear, I haven't independently verified Hixie's findings, but I presume them to be true in the absence of evidence to the contrary.)

>>> but being incompatible with editors and validators
>>> using nsgmls or other parsers that did implement HTML4 as specified.
>> 
>> Compatibility with SGML parsers doesn't really matter. The only
>> notable SGML parser-based HTML consumer is the W3C Validator and it is
>> made obsolete by HTML5 due to other reasons anyway.
> 
> Editing tools also use nsgmls (perhaps just in the background) It isn't really true to say it is "just the w3c validator".

Which tools? Is the plural really justified or is this about one Emacs mode?

>>> To introduce new parsing rules for />  at this stage but to make it so incompatible with XML is very hard to understand.
>> 
> 
>> HTML5 doesn't introduce new parsing rules in this case (except for
>> foreign content). It documents how things have always been in
>> reality. (Previous HTML specs that pretended HTML was an application
>> of SGML were out of touch with reality. HTML has always been a
>> standalone SGML-inspired language but not an application of SGML for
>> practical purposes.)
>> 
> 
> anyone (or more to the point any tool) that is using /> is almost certainly generating an empty element (because the syntax was not used until xml introduced it for that) Because people have produced <p/> or whatever and found it didn't work as expected I an sure you can find cases where the document is then "corrected" ending up with <p/>...<//p>

The way legacy content comes into being is this:
 1) A person writes some markup.
 2) The person mashes the markup until it looks good enough in his/her browser.
 3) The person publishes the markup.
 4) The person stops maintaining the markup.

At step #1, the person might *think* (s)he wrote something that results in an element with no children. However, at step #2 (s)he might have concluded that the rendering that actually arises from subsequent markup causing the element to have children looks "right". When that happens, the markup published at step #3 could break if browser started actually making the element have no children.

>> How was the process tortuous? I thought the interactions with the
>> Math WG went very nicely. As for the<annotation-xml> change in
>> particular, think the pushback from Hixie and me was much milder than
>> one could have expected for a change of that kind to the parsing
>> algorithm.
> 
> I hope we/I have a reasonable working relationship with the html group, but that doesn't mean we don't think that you are wrong on some issues.
> (I know you think I'm wrong on lots of issues)
> 
> The fact that the parser got specified that way in the first place was fairly shocking, and the fact that it wasn't just immediately accepted as a bug was fairly shocking too. The fact that you describe this as "mild pushback" for a "change of that kind" I think is indicative of the different world views that are in play here. That is the kind of language one would use for a late request for enhancement, that was weighed up and allowed to go in at the last minute, not language you would use to describe fixing a critical bug.

I can confirm that I viewed the feature as a last-minute enhancement and not as a critical bug.

In fact, I opened the blog.whatwg.org admin interface to blog about HTML content now being supported in annotation-xml by Firefox 4 and Validator.nu, but then I couldn't think of a realistic example to use in the blog post. So far, I haven't seen a single example of XHTML in annotation-xml in existing real-world application/xhtml+xml content.

If I saw some, I might change my view of the feature in question.

> In the html5 world view that isn't a problem because xml shouldn't be let loose on the web except for the special cases of mathml and svg (and the issues with svg are a bit different).

More precisely, my (I'm hesitant to claim this as a general HTML5 world view) world view says that using vocabularies that the receiving software doesn't understand is a worse way of communicating than using vocabularies that the receiving software understands. (And if you ship a JavaScript or XSLT program to the recipient, the interface of communication to consider isn't the input to your program but its output. For example, if you serve FooML with <?xml-stylesheet?> that transforms it to HTML, you are effectively communicating with the recipient in HTML--not in FooML.)

On Dec 21, 2010, at 21:51, Michael Champion wrote:

> On 12/21/10 8:41 AM, "Henri Sivonen" <hsivonen@iki.fi> wrote:
> 
>> 
>> 
>> Wrapping up HTML5 mainly has Patent Policy and public relations effects.
>> Those are important things, sure, but wrapping up HTML5 doesn't change
>> the technical constraints.
> 
> 
> I strongly disagree that Recommendations are just for PP and PR. Concrete
> Recommendations put a stake in the ground for the *users*. Implementers
> always want to revise continuously, but mainstream website developers need
> to know what is really stable and what they can count on for years to
> come.

I think stability from the user point of view is largely a PR issue. Thinking that HTML5 is too unstable to use yet is mostly an illusion. OTOH, thinking that RECs are something that are stable to rely on is an illusion, too. The truth is somewhere in the middle. The platform keeps changing all the time but it changes in mostly safe ways when something has already been adopted by multiple implementations. And in any case, the change is something the user/author doesn't get to control (well, at least not in non-IE browsers where you can't opt to use a legacy snapshot of the browser engine). A person who thinks (s)he is writing "stable" HTML 4.01 gets his/her content processed by the same ever-evolving code that processes the content made by a person who thinks (s)he is using the HTML5 draft standard.

On Dec 22, 2010, at 02:58, James Clark wrote:

> The backwards compatibility constraint is that you can't break (any significant amount of) existing content on the Web.  I appreciate and agree with that constraint.
> 
> However, this constraint alone does not require the parsing incompatibilities between HTML5 and XML.  The parsing incompatibilities only become required when you add in the design goal to eliminate modes i.e. that standards mode will be made as close as possible to quirks mode.  Now I can certainly see the advantages of this design goal, but there are also significant costs, and I think reasonable people can disagree about the right tradeoff.
> 
> Let's take perhaps the most egregious example, that HTML5 requires that <br> be treated like </br>. As of only a year or so ago, both WebKit and Gecko had made the judgement that a different treatment of </br> was desirable in standards mode (i.e. ignore it).  This is something that informed people can different opinions on.
> 
> I think presenting XML/HTML5 incompatibilities as a necessary consequence of backwards compatibility is deeply misleading.

The HTML5 effort has tried to minimize divergence and to actively seek convergence between the code paths for legacy text/html content and for newly-authored text/html content. The HTML5 effort has also sought to minimize divergence on the DOM level between tree created by parsing text/html and tree created by parsing application/xhtml+xml. The HTML5 effort has also reduced syntactic divergence by making valid syntactic talismans made popular by the infamous Appendix C.

If you wanted to make the code path for processing HTML.next more like the code path for processing XML, you'd make it less like the code path(s) for processing legacy text/html. That is, you'd introduce convergence relative to one thing but divergence relative to another. To see only the convergence, you'd need to pretend to forget about the divergence that got introduced relative to the other point of reference. Indeed, this is the trick the W3C used for the past decade: The W3C pretended HTML had been end-of-lifed and didn't exist anymore, so there was only one glorious unified XML stack to observe. But this didn't change the reality that implementations still had to support HTML.

If you introduce HTML.next that's so unlike legacy HTML that a new mode is needed but not enough like XML to use the XML code path, you haven't created convergence or reduced the number of stacks. Instead, there'd be one more stack that's divergent from both legacy HTML *and* XML!

What about an HTML.next that's 100% convergent with XML and has a mode switch for opting into? It turns out that we already have that! It's called XHTML5 and the mode switch is the Content-Type: application/xhtml+xml HTTP header. Even better than some yet-to-be-defined HTML.next mode, it's already supported by the latest versions of the top browsers (if you count IE9 as the latest version of IE).

On Dec 22, 2010, at 03:36, John Cowan wrote:

> There is a similar problem within HTML itself.  Since there is a standard
> way of handling unknown tags to which all HTML5 browsers must conform,
> then either:
> 
> 1) All future HTML tags must be processed in the same way as unknown
> tags, or
> 
> 2) Some HTML.next documents will have different DOMs in HTML5 and
> HTML.next browsers (thus discarding backward compatibility), or
> 
> 3) There will be no future HTML tags, ever.
> 
> It's not clear to me which arm of this trichotomy the WG will accept.

This is a bit of a dirty open secret of HTML5. We pretend in rhetoric that #1 is true, but in practice, if you consider elements introduced by HTML5 and how they behaved in pre-HTML5 browsers, #2 is true.

Note that this doesn't merely include the set of void elements. It also includes the set of tags that auto-close the p element. In practice, #3 can be relaxed to "There will be no future void elements ever nor elements that are meant to contain paragraphs but aren't meant to be contained in paragraphs."

On Dec 22, 2010, at 13:07, David Carlisle wrote:

> In the minutes, Henri is quoted as:
> 
>>    Henri: The counter-intuitive behavior only arises if the document is an
>>    error. If you try to do sensible stuff, you don't see this behavior.
> 
> Technically this is a true statement, as the definition of "error" is essentially "input which causes the parser to behave like this" and one might say that it is sensible to try to avoid this behaviour.
> 
> However there are many reasonable cases where the defined behaviour causes a document to break (and essentially no reasonable documents where it does anything useful).
> 
> For those not well versed in MathML, it has an annotation-xml element that allows arbitrary well formed XML as structured annotations around essentially any term. In a browser case, if the annotation isn't html then probably the _only_ thing it has to do is not mess it up and leave it in the DOM for a script of other process to use later.

The design goal of HTML5 is to allow content in a vocabulary that the browser supports and can render and to provide for clipboard export of Semantic MathML when Presentation MathML is used by the browser. It's not a goal to support arbitrary processing by scripts.

> An example for Norm, with a variable annotated with docbook:
> 
> <math>
> <mfrac>
> <semantics>
> <mi>x</mi>
> <annotation-xml encoding="docbook">
> <para>some docbook with a <code>code</code> fragment</para>
> </annotation-xml>
> </semantics>
> <mi>y</mi>
> </mfrac>
> </math>
> 
> 
> that is valid (modulo namespaces which are not relevant here) according to any published schema for mathml.

Not according to the schema html5.validator.nu uses if that counts as published. :-)

More to the point, DocBook is not XHTML+MathML in we consider that to mean "XHTML and MathML and nothing more". If you aren't allowed to dump DocBook content as a child of an HTML element, it doesn't really make sense to enable dumping it inside annotation-xml.

> In order to support one or two sites that allegedly were using an undefined <math> element containing html but which still work if the math is closed and the html is moved outside the math, then every editing and document production system that is producing documents is supposed to somehow have special code to avoid this happening _forever_?

The "special" code shouldn't look for certain local names. Editing systems should refrain from serializing elements that aren't in the http://www.w3.org/1999/xhtml, http://www.w3.org/1998/Math/MathML or http://www.w3.org/2000/svg namespaces when outputting to text/html (.html).

> It affects SVG in a slightly different way. SVG has a very similar foreignObject element that again is supposed to be able to take arbitrary well formed XML. However to avoid the above problem foreignObject is defined by the html5 parser to take html content (like annotation-xml[@encoding='text/html'] or mtext) so the case of html in svg is OK, but any other XML in that position will be mis-parsed if it uses empty element syntax.

HTML5 isn't trying to support SVG's integration points in an open-ended way. HTML5 is trying to support the integration of HTML, MathML and SVG--not arbitrary other vocabularies. Therefore, the content that isn't SVG can only be HTML or MathML content. The HTML5 parsing algorithm makes both work in foreignObject. (Also if <svg> appears as a child of <foreignObject>, it, too, works.)

Likewise, since only HTML, MathML and SVG are being integrated--not arbitrary other vocabularies--only HTML, MathML and SVG children of annotation-xml are catered for.

On Dec 22, 2010, at 22:00, Michael Kay wrote:

> >  * Being liberal in what you accept has arguably proven useful on the  Web,
> 
> I've always felt there is a third way here. When content is wrong, don't punish the end user (it's not their fault), but don't give the impression that everything is hunkydory either. Repair the content and display it as best you can, but tell the user you've repaired it (perhaps even ask them whether they want you to repair it), warn them that it might be incomplete or incorrectly formatted, and invite them to contact the webmaster to get it fixed.

This would be bad, because introducing new features would become hard as Web authors would shy away from using new features that old browsers label as broken in a user-visible way. 

> Without this, we seem to be in a descending spiral of declining content quality. What happens when browsers start auto-repairing broken Javascript code or JSON data?

There are security incentives that work against starting to repair broken JavaScript where "broken" is what's broken per ES3. However, I wouldn't be at all surprised if we ended up in a situation where every vendor has an incentive not to enforce the ES5 Strict Mode in order to "work" with more Web content than a competing product that halts on Strict Mode violations and the ES5 Strict Mode effort collapsed.

On Dec 23, 2010, at 05:18, Kurt Cagle wrote:

> 1) RSS2.0 documents are notorious for being "unparseable" within XML.

It seems that when an XML vocabulary becomes popular among people who aren't XML experts, stuff becomes unparseable as true XML. Not a great sign of success for XML's Draconian policy and the theory behind it.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Tuesday, 4 January 2011 13:09:03 UTC