Re: HTML/XML TF Report glosses over Polyglot Markup

Robin Berjon wrote:
> 
> I think it's safe to say that you can't throw arbitrary HTML content 
> from off the Web at an XML parser and expect it to work.
> 

But, is that *what* anyone expects to work?  Or, is this why popular
XSLT libraries are configurable to read raw HTML?  I've thrown arbitrary
real-world HTML at XML toolchains using HTML Tidy, TagSoup, Resin httpd,
and libxslt to start the chain, and had them all work as expected
despite the fact that none of the HTML source documents parsed as XML.

So I don't think parsing Web pages as XML gives an accurate impression
of how many HTML documents are effectively being used as polyglot, by
screen-scrapers using XPath and other XML tools in libraries capable of
consuming invalid HTML.  I don't know that's something that can be
asessed by crawling the Web, but I do know the capability of some XML
tools to read HTML didn't come about without significant developer
demand.

>
> This is no reflection on the value of polyglot mind you. But it is
> the reality of the question that that report was responding to. If
> you want to process HTML using an XML toolchain, put an HTML parser
> in front of it.
> 

Advice which still doesn't make sense to me.  I used to do it that way,
with Tidy and TagSoup, but have found it's simpler to just use an XSLT
engine capable of reading raw HTML, since I'm using XSLT/Schematron/
RELAX NG to apply my own input validation rules where I'm accepting HTML
markup as application input.  Why add another tool to the chain?

Maybe the polyglot document could mention that XML toolchains exist
which accept invalid HTML as input -- proliferation of this feature
seems to confirm the demand to process HTML as XML, and reinforces the
need for polyglot, IMO.

-Eric

Received on Monday, 3 December 2012 23:37:15 UTC