Re: Revised HTML/XML Task Force Report from Robin Berjon on 2011-07-14 (www-tag@w3.org from July 2011)

From: Robin Berjon <robin@berjon.com>
Date: Thu, 14 Jul 2011 16:50:00 +0200
To: Larry Masinter <masinter@adobe.com>
Cc: "www-tag@w3.org List" <www-tag@w3.org>
Message-Id: <FA1DBF7A-0724-4977-8F9B-488651017048@berjon.com>

On Jul 14, 2011, at 16:14 , Larry Masinter wrote:
>> What would that accomplish that putting an HTML parser at the front of your XML processing pipeline won't achieve far more cheaply and without requiring tinkering with undeployed media types?
> 
> If you're going to make a case for choosing one of the options over another based on relative deployment cost ("far more cheaply"), I think you need to provide some analysis of deployment costs.

If we were in a situation in which both options were theoretical that would be the case, but we're not. One the one hand we have existing content (HTML), existing software (HTML parsers), existing standards (most notably the Infoset coercion rules for HTML), and an existing XML tool chain. When you get HTML content you run it through an HTML parser in the exact same way that you run XML content through an XML parser or EXI content through an EXI parser. At the other end if you want to use an XML tool chain you do: XPath, XSLT, XQuery, you name it will just work and will be able to process a predictable Infoset. This seems to me to be a perfectly reasonable answer to the use case that the report was considering, namely "How can an XML toolchain be used to consume HTML?"

Now if it so happens that you parse the HTML content with an XML parser, unless it is polyglot and limits itself to a given subset of HTML then you won't get the same Infoset out of it. But then again, you're using the wrong tool. It might hypothetically be possible to craft a GIF such that it would decode in a PNG processor. There may even be cases in which that's useful. But trying to marry them doesn't seem useful.

Likewise polyglot can be useful in some cases. But it's not a general solution today, and if we're going to commit to the cost of making it possible for it to be a general solution then it better bring value that makes that work worth it compared to the existing solution which is to just grab a parser off GitHub.

I thought we'd settled this with binaryXML-30: we can live and prosper with multiple universal formats. Universal doesn't mean that there is only one. It means that it's everywhere.

> What is the cost of putting an HTML parser in front of every XML processing pipeline?

Why would you want to put an HTML parser in front of every XML processing pipeline? Most of those will never see HTML.

> Have people done this? What has been their experience? Does it really work? 

It's an age-old trick. People have been putting non-XML parsers in front of their XML tool chains ever since there have been standard (or standard enough) XML APIs. See for instance this article from ten years ago: http://www.xml.com/pub/a/2001/09/19/sax-non-xml-data.html.

Or am I missing something in your questions? Somehow it doesn't seem like something that was a common trick for Perl hackers ten years ago would still be considered a problem.

The issue that there was previously was that there were no rules for parsing HTML. That made it unpredictable. But that is no longer the case.

-- 
Robin Berjon - http://berjon.com/ - @robinberjon

Received on Thursday, 14 July 2011 14:50:38 UTC