Re: XML payloads in feeds from Sam Ruby on 2011-01-07 (public-html-xml@w3.org from January 2011)

From: Sam Ruby <rubys@intertwingly.net>
Date: Fri, 07 Jan 2011 12:38:41 -0500
To: Henri Sivonen <hsivonen@iki.fi>
CC: public-html-xml@w3.org
Message-ID: <4D274FA1.9030302@intertwingly.net>
On 01/07/2011 11:54 AM, Henri Sivonen wrote:
> On Jan 7, 2011, at 14:24, Sam Ruby wrote:
>
>> You are confusing one of the problems with the RSS 2.0 title
>> element (which is allegedly backwards compatible with NetScape's
>> 0.91 which explicitly described title as plain text and UserLand's
>> 0.91 which exclusively used title for HTML) with RSS 2.0's
>> description element.
>
> No, I meant<desciption>  as a child of<item>.
>
>> To help anchor this discussion, here is the an actual bug report:
>>
>> https://bugzilla.mozilla.org/show_bug.cgi?id=602304
>
> It seems to me that there are two problems here. Since MathML isn't
> *arbitrary* XML but is a Web vocabulary (where (X)HTML, MathML, SVG
> and ARIA are Web vocabularies), this isn't an issue with RSS and
> arbitrary XML.
>
> 1) Having to hack a system that was designed for text/html to emit
> application/xhtml+xml in order to use MathML. Once HTML5 parsers are
> deployed on the consumer side, there's no need to switch from
> html/html to application/xhtml+xml in order to use MathML.

While that is a true statement, it generally is not the case that 
everything that can be done in xhtml can be done in html (or vice versa, 
for that matter).

> 2) Planet Venus not parsing HTML fragments using the HTML5 fragment
> parsing algorithm. Once an HTML5 parser is available for the
> programming language Planet Venus is written in, Planet Venus can
> start parsing HTML fragments using the HTML5 fragment parsing
> algorithm.

Planet Venus is written in Python, and there is an an excellent HTML5 
parser available in the form of html5lib.  In fact, Venus can and does 
use html5lib, just not consistently.  It turns out that Venus depends on 
the feedparser library which does it own parsing using sgmllib. 
Replacing sgmllib with html5lib turns out to be a non-trivial effort and 
actually changes the behavior of the feedparser.  You can see the 
results of my analysis here:

http://intertwingly.net/blog/2010/12/30/Dealing-with-HTML-in-Feeds

A quick summary: it is doable, but it is a fair amount of work, and will 
break an unknown number of things, many of which we will only find out 
about after a release is made and it gets in the hand of real users. 
Remember: the feedparser is used in many places, not just in Venus.

Add to the mix the fact that Python itself is in a transition period 
from Python 2 to Python 3 and there is a number of logistical challenges 
that need to be solved simultaneously.

We can certainly continue this discussion at any level of depth you 
like, but the point remains that partial solutions (like your point 1) 
and all but ignoring the quite significant logistical challenges that 
many people face (your point 2) doesn't further the discussion.

In many ways, with Venus we have a bet case scenario: somebody who is 
well versed in both HTML5 and XML, one who not only has access to all of 
the source but direct commit rights to each piece (feedparser, html5lib, 
and venus), and yet I am finding this to be quite a challenge to pull off.

I submit that in many cases where people are dealing with off-the-shelf 
software and a backlog of customer requirements.  I suggest that it will 
likely be quite a while before GoogleReader and NetNewsWire become 
mathml friendly.  I'll go further and say that most of the long tail of 
other readers will never get upgraded.

And I haven't even begun to touch on some of the secondary 
characteristics of some of these components.  Producing the xhtml for my 
planet from the cached data (formatted as Atom entries) takes under 200 
milliseconds on my machine.  Simply converting the xhtml to html using 
html5lib takes an additional 1.2 seconds.  In my case, this is not a 
problem as it is all done off a cron job, but there may be some very 
real scenarios where this is simply not acceptable.

Part of the appeal of the feedparser to date is that it is a single 
source file that has virtually no hard dependencies.  This is something 
that would radically be changed by creating a hard dependency on html5lib.

Over time undoubtedly the parsers will improve and html5 will become a 
part of the standard libraries for many programming langauges, but the 
fact remains that at the present time very little is available to people 
who write scripts that compares to xsltproc and libxml2.

- Sam Ruby
Received on Friday, 7 January 2011 17:39:16 UTC