- From: Eric J. Bowman <eric@bisonsystems.net>
- Date: Mon, 3 Dec 2012 16:36:25 -0700
- To: Robin Berjon <robin@w3.org>
- Cc: "Henry S. Thompson" <ht@inf.ed.ac.uk>, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, Henri Sivonen <hsivonen@iki.fi>, public-html WG <public-html@w3.org>, www-tag@w3.org
Robin Berjon wrote: > > I think it's safe to say that you can't throw arbitrary HTML content > from off the Web at an XML parser and expect it to work. > But, is that *what* anyone expects to work? Or, is this why popular XSLT libraries are configurable to read raw HTML? I've thrown arbitrary real-world HTML at XML toolchains using HTML Tidy, TagSoup, Resin httpd, and libxslt to start the chain, and had them all work as expected despite the fact that none of the HTML source documents parsed as XML. So I don't think parsing Web pages as XML gives an accurate impression of how many HTML documents are effectively being used as polyglot, by screen-scrapers using XPath and other XML tools in libraries capable of consuming invalid HTML. I don't know that's something that can be asessed by crawling the Web, but I do know the capability of some XML tools to read HTML didn't come about without significant developer demand. > > This is no reflection on the value of polyglot mind you. But it is > the reality of the question that that report was responding to. If > you want to process HTML using an XML toolchain, put an HTML parser > in front of it. > Advice which still doesn't make sense to me. I used to do it that way, with Tidy and TagSoup, but have found it's simpler to just use an XSLT engine capable of reading raw HTML, since I'm using XSLT/Schematron/ RELAX NG to apply my own input validation rules where I'm accepting HTML markup as application input. Why add another tool to the chain? Maybe the polyglot document could mention that XML toolchains exist which accept invalid HTML as input -- proliferation of this feature seems to confirm the demand to process HTML as XML, and reinforces the need for polyglot, IMO. -Eric
Received on Monday, 3 December 2012 23:37:15 UTC