Re: HTML/XML TF Report glosses over Polyglot Markup

On 03/12/2012 12:02 , Henry S. Thompson wrote:
> Robin Berjon writes:
>> Saying "polyglot" here just doesn't help: very little real-world
>> content uses it. Note that the section clearly looks at polyglot and
>> gives a clear reason for not using it in this case.
>
> That depends on where you look.  I know of a number of companies whose
> products produced, by design, HTML-compatible XHTML, which we would
> now call polyglot, precisely because it gave them the ability to
> post-process with XML tools while at the same time serving to IE6
> clients confidently.  The parallel requirements aren't going away, and
> polyglot HTML5 will serve them very well.

I know there is polyglot in the wild, I've used it in the past. But 
there's a big difference between "some people use it" and "it's used 
enough that one can build a useful strategy relying on it for arbitrary 
content".

Taking the Paciello Group Dataset[0] which has the index page from 8881 
of the top 10k sites (this skews the data towards sites that are 
actively maintained rather than containing legacy content, and pages 
that are paid more attention to than deeper ones, which probably ought 
to help XML here more than not), I tried to parse them as XML. Note that 
I'm just looking for something that won't blow up when fed to an XML 
parser, not a document that's been carefully crafted to be proper 
polyglot and produce a sufficiently equivalent DOM. I get the following 
result:

Out of 8881, 569 parse as XML (6.41%) whereas 8312 blew up (93.59%).

I think it's safe to say that you can't throw arbitrary HTML content 
from off the Web at an XML parser and expect it to work.

This is no reflection on the value of polyglot mind you. But it is the 
reality of the question that that report was responding to. If you want 
to process HTML using an XML toolchain, put an HTML parser in front of it.

[0] 
http://www.paciellogroup.com/blog/2012/04/html5-accessibility-chops-data-for-the-masses/

-- 
Robin Berjon - http://berjon.com/ - @robinberjon

Received on Monday, 3 December 2012 12:49:01 UTC