W3C home > Mailing lists > Public > www-tag@w3.org > July 2011

Re: Revised HTML/XML Task Force Report

From: Robin Berjon <robin@berjon.com>
Date: Fri, 15 Jul 2011 17:38:54 +0200
Cc: Larry Masinter <masinter@adobe.com>, "www-tag@w3.org List" <www-tag@w3.org>
Message-Id: <64B989AE-68E7-44C3-9C54-40D211010600@berjon.com>
To: Eric J. Bowman <eric@bisonsystems.net>
Hi Eric,

On Jul 14, 2011, at 18:20 , Eric J. Bowman wrote:
> Robin Berjon wrote:
>> Likewise polyglot can be useful in some cases. But it's not a general
>> solution today, and if we're going to commit to the cost of making it
>> possible for it to be a general solution then it better bring value
>> that makes that work worth it compared to the existing solution which
>> is to just grab a parser off GitHub.
> Disagree.  HTML wrapped in Atom serialized as XML to allow Xpath access
> into the wrapped content, is quite a common design pattern and requires
> the markup to be polyglot.

The first thing I'd like to note here is that this is not the use case that we were discussing. We had been looking at "How can an XML toolchain be used to consume HTML?" while this is "How can islands of HTML be embedded in XML?" It's an interesting use case though, but I think that which solution to pick boils down to how general you need it to be.

Speaking personally, I actually use XPath on Atom documents that contain polyglot HTML. But I don't need the polyglotism to be general. So long as I can get at an <h1>, the first <p> of a <section>, a @lang somewhere it works and I'm happy to say that I don't recall bumping into a problem there. In other cases, it would break down. For instance, I have a decent amount of HTML that contains fragments like the following:

    <script type='application/foo-template' id='foo'><foo><dahut/></foo></script>

I don't see much of an easy way of making that polyglot. But I also don't see many use cases in which one would be using this specific pattern and have it be required to be embedded in XML and parse in the same way as it does in HTML. So to reiterate my initial comment in a more detailed fashion, we have multiple cases:

     When limited polyglotism is enough. In this case, I think we have pretty much most of what we already need. It would be interesting to get into details of what could perhaps be improved, but that is likely best formulated as comments on the Polyglot document.

     When limited polyglotism is not enough, but we require mixing XML and HTML anyway. Here we have two solutions:

         Design XML.NG so that it can process both existing XML and HTML content in such a way as to guarantee near-perfect polyglotism. I'm not opposed to this, but I think everyone agrees that this is not a minor undertaking. If people want to get to work on this then I wish the best of luck to them, but I would be leery of committing major W3C resources to this project.

         Find ways of processing HTML as HTML whenever it needs to be. This includes parsing HTML as such and using Infoset coercion[0] for the 2.1 use case. For the HTML-embedding use case it also includes building small, simpler, more manageable bridges. For instance, making document() parse text/html input as per HTML5 or including a parse-html() function so that you could embed HTML as a CDATA section and process it usefully anyway (in fact the former could be parse-html(unparsed-text($uri-sequence)).

I personally think that a "Small Bridges" project would be a nice recommendation for the convergence TF to make where next steps are concerned.

> I've been advocating the polyglot approach for a long time, now (just
> not calling it that).  My advice to everyday Web developers is to use
> XHTML 1.1 and application/xhtml+xml to develop sites, and a few lines
> of XSLT to convert to HTML 4.01 or XHTML 1.0 for publication as text/
> html.

I think that it depends heavily on the type of site that one is developing. I used polyglot+XSLT on some sites that are primarily static and content oriented. I've found it to be less useful for application-oriented sites. YMMV.

> I don't see where an HTML parser needs to enter into it, except of
> course for the browser

But the use case you've described is very specific. I'm certainly not saying that people should be forced to use an HTML parser whenever they see something that vaguely smell of HTML. If polyglot's limits work for you, then I don't think there actually is a problem to solve. I'm simply saying that trying to lift the current limits on polyglotism is a major undertaking, and one in which I see very limited value.

> , but I do see a considerable cost if oXygen and
> every other XML-based toolchain used to maintain the installed base of
> XML is required to increase its complexity with another parser for the
> same markup.

No offense, but I'm having a hard time buying a complexity argument for tools that include XML Schema validation :) Also, it's not another parser for the same markup. It's another parser for different markup, different markup that can't be made the same.

>  Polyglot makes sense, as I'm hardly alone in using Atom as
> a wrapper for HTML content, serialized as XHTML so I don't lose Xpath
> access into that content.

Which is fine, but only works because processing the subset of HTML that you are using as XML doesn't break it.

> Or am I a bad Web developer because I didn't just use Javascript?  ;-)
> I'd prefer if the Web, moving forwards, didn't exclude (requiring two
> parses of every file I read would make my architecture untenable)
> perfectly legitimate application architectures as punishment to those
> of us who insist on using XML toolchains in defiance of browser
> vendors' opinions on the matter.

That's certainly not what I've been suggesting. But you're describing a set up that works today, so I'm having trouble figuring out what problem you're complaining about. 

[0] http://www.w3.org/TR/html5/the-end.html#coercing-an-html-dom-into-an-infoset

Robin Berjon - http://berjon.com/ - @robinberjon
Received on Friday, 15 July 2011 15:39:21 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:56:39 UTC