- From: Robin Berjon <robin@w3.org>
- Date: Mon, 03 Dec 2012 13:48:40 +0100
- To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
- CC: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, Henri Sivonen <hsivonen@iki.fi>, public-html WG <public-html@w3.org>, www-tag@w3.org
On 03/12/2012 12:02 , Henry S. Thompson wrote: > Robin Berjon writes: >> Saying "polyglot" here just doesn't help: very little real-world >> content uses it. Note that the section clearly looks at polyglot and >> gives a clear reason for not using it in this case. > > That depends on where you look. I know of a number of companies whose > products produced, by design, HTML-compatible XHTML, which we would > now call polyglot, precisely because it gave them the ability to > post-process with XML tools while at the same time serving to IE6 > clients confidently. The parallel requirements aren't going away, and > polyglot HTML5 will serve them very well. I know there is polyglot in the wild, I've used it in the past. But there's a big difference between "some people use it" and "it's used enough that one can build a useful strategy relying on it for arbitrary content". Taking the Paciello Group Dataset[0] which has the index page from 8881 of the top 10k sites (this skews the data towards sites that are actively maintained rather than containing legacy content, and pages that are paid more attention to than deeper ones, which probably ought to help XML here more than not), I tried to parse them as XML. Note that I'm just looking for something that won't blow up when fed to an XML parser, not a document that's been carefully crafted to be proper polyglot and produce a sufficiently equivalent DOM. I get the following result: Out of 8881, 569 parse as XML (6.41%) whereas 8312 blew up (93.59%). I think it's safe to say that you can't throw arbitrary HTML content from off the Web at an XML parser and expect it to work. This is no reflection on the value of polyglot mind you. But it is the reality of the question that that report was responding to. If you want to process HTML using an XML toolchain, put an HTML parser in front of it. [0] http://www.paciellogroup.com/blog/2012/04/html5-accessibility-chops-data-for-the-masses/ -- Robin Berjon - http://berjon.com/ - @robinberjon
Received on Monday, 3 December 2012 12:49:00 UTC