Re: What problem is this task force trying to solve and why? from Henri Sivonen on 2011-01-05 (public-html-xml@w3.org from January 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 5 Jan 2011 12:31:17 +0200
To: public-html-xml@w3.org
Message-Id: <B150778F-694C-43DA-AC3C-43B2B57FE8D0@iki.fi>
On Jan 4, 2011, at 16:34, David Carlisle wrote:

>>> This behaviour really has no justification.
>> 
>> It sure has. Hixie ran an analysis over a substantial quantity of Web
>> pages in Google's index and found existing text/html content that
>> contained an<svg>  tag or a<math>  tag. The justification is making
>> the algorithm not break a substantial quantity of pages like that.
> 
> "no justification" was perhaps overstating it but I think that this is very weak justification, especially as it isn't even preserving the old behaviour.

It's trying to preserve old readability properties. For example so that paragraphs of text don't disappear from the rendering because they got swallowed into an <svg> subtree.

> If math-encoded-as-html was previously wrapped in a math tag there was presumably some reason for that wrapping, styling with css, or accesing it with javascript or something, neither of which would still work, although admittedly the fallback behaviour is improved by the current html5 model.

Or there was just random cargo-cult copying and pasting going on. How else would you explain the markup seen in http://junkyard.damowmow.com/339 ?

>> If you want to contest Hixie's findings, I suggest running and
>> publishing your own analysis of a substantial corpus of Web content.
>> (By running the current HTML parsing algorithm instrumented to count
>> how often<math>  and<svg>  are seen and how often the break-out
>> behavior is triggered.) Unfortunately, people who don't have access
>> to Google's or Bing's index are not in as good a position to run such
>> analyses as Hixie, but running an analysis over dotbot data
>> (http://www.dotnetdotcom.org/) goes a long way further than mere
>> conjecture.
> 
> I could do, but what's a lot, how many pages do I need to find before it is significant?
> 10, 100, 10000, 1000000?

There's no hard number. There are too many pages that break if the number of pages that break makes people who have veto over browser code base changes too nervous to allow the breaking change you are proposing. What makes people too nervous isn't entirely logical and depends on the people, their mood, their view of the expected benefit of the change, etc.

Sometimes people treat one report of one breaking page as something that needs to be fixed in the consuming code. Sometimes people are OK with breaking output from an authoring tool even if such output is sprinkled across many sites and can't be evangelized and modified as a single operation.

The foreign content feature of the HTML parsing algorithm probably errs on the side of being more prudent that strictly necessary. However, this has been a great success in the sense that the feature hasn't resulted in report of broken pages, so no one with veto over what goes in Gecko has gotten scared and started arguing for killing the feature. 

> Also as I say, even when the breakout behaviour is triggered, that is not at all evidence that the existing behaviour is being preserved unless you also check that no css or javascript is assuming the html is inside the math element.

Yeah, one would probably also need to manually examine at least some of the discovered pages that have <svg> or <math> in them.

>> (To be clear, I haven't independently verified Hixie's findings, but
>> I presume them to be true in the absence of evidence to the
>> contrary.)
>> 
> 
>> Which tools? Is the plural really justified or is this about one
>> Emacs mode?
> 
> well personally I tend to use emacs, so I'm not aware of what parser other html tools use to drive context dependent support, but certainly I've seen tools with buttons to run locally installed (nsgmls based) validators on the generated documents.

Validators written for an old language snapshot are necessarily obsoleted when you want to start using a newer snapshot of the evolved language.

>> The way legacy content comes into being is this: 1) A person writes
>> some markup. 2) The person mashes the markup until it looks good
>> enough in his/her browser. 3) The person publishes the markup. 4) The
>> person stops maintaining the markup.
>> 
>> At step #1, the person might *think* (s)he wrote something that
>> results in an element with no children. However, at step #2 (s)he
>> might have concluded that the rendering that actually arises from
>> subsequent markup causing the element to have children looks "right".
>> When that happens, the markup published at step #3 could break if
>> browser started actually making the element have no children.
> 
> I agree that this is a problem, but as James Clark commented earlier
> such quirks could have been constrained to a parsing mode that was not used for <!doctype html> The fact that you're trying to makes quirks mode just affect css rather than parsing as far as possible is not altogether unreasonable, but not something that could not clearly have been done differently.

Making all processing modes for text/html converge is a matter of convergence. To the extent people are really concerned about introducing more stacks, convergence in that department should be good. However, if people aren't *really* concerned about code path proliferation but are concerned about XML falling from grace at the W3C, the conclusions are different, of course.

I hope we can agree that the code paths for consuming legacy HTML and the for consuming legacy XML can't be unified, so the lower bound for the number of parsing code paths is 2 anyway.

>> More precisely, my (I'm hesitant to claim this as a general HTML5
>> world view) world view says that using vocabularies that the
>> receiving software doesn't understand is a worse way of communicating
>> than using vocabularies that the receiving software understands. (And
>> if you ship a JavaScript or XSLT program to the recipient, the
>> interface of communication to consider isn't the input to your
>> program but its output. For example, if you serve FooML
>> with<?xml-stylesheet?>  that transforms it to HTML, you are
>> effectively communicating with the recipient in HTML--not in FooML.)
> 
> I'd agree with that, but here the input to javascript (or xslt or whatever) isn't the whole document served as application/xml it's a fragment of a text/html document inside annotation-xml or svg foreign content (or for that matter an html div) the current html5 rules make making such a fragment in a way that can be safely parsed very much harder than it could have been, in particular because it can't just be deferred to an html5 serialiser. Even if an html5 serialiser were added to xslt for example, it would be unlikely (I would guess) to be able to do anything to avoid these local name clashes in foreign content. So the generation itself will have to be programmed in each case to avoid this. So to give the example that I gave earlier, while annotating an expression with docbook should be trivial, annotating in a way that may be parsed by html5 is rather harder.

Yeah, to smuggle DocBook data in MathML inside text/html, you'd need to serialize the DocBook fragment as XML, put the result in a text node inside <annotation> (as opposed to <annotation-xml>) and then serialize that as text/html. This would be rather RSS-ish and, therefore, inelegant from an XML point of view.

>> Likewise, since only HTML, MathML and SVG are being integrated--not
>> arbitrary other vocabularies--only HTML, MathML and SVG children of
>> annotation-xml are catered for.
> 
> This is of course true, and I suppose in a way all the comments that you have fielded from different people are just special cases of queries where this restriction is seen as unnecessarily restrictive.

I'm not viewing this from the point of view of whether it's unnecessarily restrictive. I'm viewing this from the point of view of the marginal benefit and marginal cost. The complexity cost of adding more XML-like subtree parsing for arbitrary XML in <annotation-xml> seems rather large compared to the benefit. I consider benefit on the Web scale here, so the "average" benefit is low if the feature wouldn't typically be used even if it were perceived to be super-important by a handful of specialists.

>> It turns out that we already have that! It's called XHTML5 and the
>> mode switch is the Content-Type: application/xhtml+xml HTTP header.
>> Even better than some yet-to-be-defined HTML.next mode, it's already
>> supported by the latest versions of the top browsers (if you count
>> IE9 as the latest version of IE).
> 
> 
> Traditionally of course using application/xhtml+xml has been problematic due to lack of support in IE (and that will still be the case in practice for some time as it takes a while for old IEs to die).

Yes, but a yet-to-be-defined HTML.next mode wouldn't be supported by IE9, IE < 9 or the current installed base of non-IE browsers. However, application/xhtml+xml already works in IE9 and the current installed base of non-IE browsers.

> But even if we assume IE9, there are many other reasons why text/html is more convenient. I have a blog on blogger for example, as far as I know I can't control the mime type used for pages there. Many other content management systems are similarly (and not at all unreasonably) text/html based.

Yes, this is one of the arguments in favor of keeping evolving text/html.

> There is (and will continue to be) a natural desire to use the xml toolset to generate content that is served as text/html.
> Currently this is rather fragile and error prone, unless you use a dedicated html serialiser at the end of the chain. It's probably not much worse in fact than html4, but it could (perhaps) have been made better if that had been part of the html5 design criterion (which I suspect it wasn't).

It doesn't seem plausible to change HTML so much that anything that an XML serializer could legitimately produce would parse right. If we don't get that far, a text/html-safe serializer is needed anyway and tweaking the details isn't much of a win.

> It seems to be very common to use xhtml syntax on pages served as text/html
> 
> http://www.w3.org/
> 
> for example or
> 
> http://www.drupal.org.uk/

Drupal has been believing the XHTML2 WG advocacy (RDFa) even after HTML5 was brought into the W3C. As has the W3C itself. I think its not a useful use of effort to try to bail out authors who go out their way to look away from HTML5 into the XHTML2 WG land where specs were reviewed for processing as XML but were silently condoned or even pushed for deployment in text/html nonetheless.

> or ...
> 
> Currently this is just an error waiting to happen (for example try mouse-ing over paragraphs in
> 
> http://www.w3.org/TR/2009/REC-MathML2-20090303/chapter1.html#intro.notation
> 
> )

In 2009 you should have already known better than to serve <a/> as text/html. :-( I don't deny that this is a problem, but it's a problem whose parser solution would cause other problems, so I'd rather continue with solving the problem with counter-propaganda than by changing HTML parsing.

On Jan 4, 2011, at 21:51, John Cowan wrote:

> Henri Sivonen scripsit:
> 
>> On Dec 20, 2010, at 17:50, David Carlisle wrote:
>> It sure has. Hixie ran an analysis over a substantial quantity of                                                           
>> Web pages in Google's index and found existing text/html content that                                                       
>> contained an <svg> tag or a <math> tag. The justification is making                                                         
>> the algorithm not break a substantial quantity of pages like that.                                                          
> 
> A number would be nice.  One person's "substantial" is another person's                                                       
> "trivial", unfortunately.

I don't have that number. The number might be buried somewhere in the public-html archives or the #whatwg IRC logs.

>>> Editing tools also use nsgmls (perhaps just in the background)                                                            
>>> It isn't really true to say it is "just the w3c validator".                                                               
>> 
>> Which tools? Is the plural really justified or is this about one                                                            
>> Emacs mode?                                                                                                                 
> 
> You are confusing nsgmls itself with the Emacs mode (which employs                                                            
> nsgmls).  Nsgmls is a stand-alone SGML validator that outputs an                                                              
> ESIS equivalent to the document being validated.  ESIS is a textual                                                           
> representation of SAX-style events, one line per event.  It's the core                                                        
> of any reasonably modern SGML system.

I'm aware that it's a reusable parser. But is any SGML system reasonably modern anymore? More seriously: How substantial is the population of Web authors whose authoring workflow depends on nsgmls and who'd use currently nsgmls-incompatible HTML5 features if they were compatible? And further: How does this population partition into Emacs users and others?

>> More precisely, my (I'm hesitant to claim this as a general HTML5                                                           
>> world view) world view says that using vocabularies that the receiving                                                      
>> software doesn't understand is a worse way of communicating than using                                                      
>> vocabularies that the receiving software understands. (And if you                                                           
>> ship a JavaScript or XSLT program to the recipient, the interface                                                           
>> of communication to consider isn't the input to your program but                                                            
>> its output. For example, if you serve FooML with <?xml-stylesheet?>                                                         
>> that transforms it to HTML, you are effectively communicating with                                                          
>> the recipient in HTML--not in FooML.)                                                                                       
> 
> This argument strikes me as a defense of putting arbitrary XML on the
> wire, since it is not (in your sense of the term) the interface of
> communication.

Only if you don't care about recipient-side performance (i.e. you are OK with making your recipients burn CPU cycles and RAM running your program) and you only care about supporting recipients that have the facilities for running your program.

Browsers have the facilities for running JavaScript or XSLT programs, but we are often reminded about non-browser consumers of Web content.

>> More to the point, DocBook is not XHTML+MathML in we consider that
>> to mean "XHTML and MathML and nothing more". If you aren't allowed to
>> dump DocBook content as a child of an HTML element, it doesn't really
>> make sense to enable dumping it inside annotation-xml.
> 
> However, if the day came in which DocBook was an equal-partner vocabulary
> (unlikely as that may seem, stranger things have already happened),
> we would have to add yet another hack to make it work inside MathML.

I think the probability of DocBook becoming an "equal-partner vocabulary" is so small that I'd put engineering for the situation into the "let's cross that bridge when we get there department".

> It is one thing to say it's not valid HTML to incorporate a foreign
> vocabulary inside MathML-in-HTML annotations.  It's another thing to
> ensure that such vocabularies are already broken.

It doesn't make much sense to expend engineering effort to make yet-to-be-written invalid content non-broken. (It does make sense to expend engineering effort to make invalid legacy content non-broken if it was non-broken in browsers that dominated the market at the time of content creation.)

>> There are security incentives that work against starting to repair
>> broken JavaScript where "broken" is what's broken per ES3. However, I
>> wouldn't be at all surprised if we ended up in a situation where every
>> vendor has an incentive not to enforce the ES5 Strict Mode in order to
>> "work" with more Web content than a competing product that halts on
>> Strict Mode violations and the ES5 Strict Mode effort collapsed.
> 
> Strict mode is a programmer choice, not an implementer choice.  The code
> has to contain a "use strict" directive.

It's not really an *informed* choice if "use strict" ends up in a JS program by cargo cult. (And on the Web, stuff ends up in places by cargo cult.)

Here's the scenario:
 1) Developer writes a JavaScript program.
 2) The developer feeds the program to JSLint.
 3) JSLint tells the developer to use "use strict".
 4) The developer adds "use strict".
 5) The developer tests the code in a browser that doesn't implement Strict Mode. The code works.
 6) The developer deploys the code.
 7) The code breaks in a browser that implements Strict Mode.

A variation of the scenario is that at step #4 the resulting program would work in the Strict Mode, but "use strict" is added to the top level of the JS file and a later optimization process concatenates multiple .js files for minification and then "use strict" applies to everything that was concatenated into one file.

See
https://bugzilla.mozilla.org/show_bug.cgi?id=593963
https://bugzilla.mozilla.org/show_bug.cgi?id=579119
https://bugzilla.mozilla.org/show_bug.cgi?id=587249
https://bugzilla.mozilla.org/show_bug.cgi?id=607188
https://bugzilla.mozilla.org/show_bug.cgi?id=614195
https://bugzilla.mozilla.org/show_bug.cgi?id=615659

One should expect a similar pattern to arise if we introduced an in-band switch for text/html pages to opt into Draconian XML parsing.

On Jan 4, 2011, at 23:40, John Cowan wrote:

> Kurt Cagle scripsit:
> 
>> One other possibility that comes to mind is simply to create a
>> <foreignContent> element in HTML5. SVG has a similar element (usually
>> for holding HTML, oddly enough). This would simply tell the processor
>> to not display the content in question, not to parse it, not to do
>> anything with it.
> 
> That's what "script" does, and I see no reason to duplicate it.

I agree.

> On the other hand, having an <xml> element which says that everything up
> to the matching </xml> is well-formed XML (without prologue or epilogue)
> and should be incorporated into the DOM seems a good idea to me.  If there
> are legacy concerns with <xml>, use <well-structured-extension> or for
> that matter <scritchifchisted> instead.

If the purpose of the data island is to be input to a JavaScript program on the page, why should the HTML parsing layer give XML some kind of most favored status as input to JavaScript programs? Why not just make a call to DOMParser be the first step of in program that want to use XML?

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 5 January 2011 10:32:24 UTC