W3C home > Mailing lists > Public > public-html@w3.org > December 2012

Re: HTML/XML TF Report glosses over Polyglot Markup (Was: Statement why the Polyglot doc should be informative)

From: Robin Berjon <robin@w3.org>
Date: Mon, 03 Dec 2012 11:56:05 +0100
Message-ID: <50BC8545.1040707@w3.org>
To: David Carlisle <davidc@nag.co.uk>
CC: public-html WG <public-html@w3.org>, www-tag@w3.org
On 03/12/2012 11:49 , David Carlisle wrote:
> On 03/12/2012 10:35, Robin Berjon wrote:
>> Case in point: we have a few large HTML datasets at hand which we
>> can use to look at how HTML is used in the wild. But for the most
>> part we tend to be limited to grepping, to some simple indexing, or
>> to parsing them all and applying some ad hoc rules to extract data
>> (which is slow). It would be sweet to be able to just load those
>> into a DB and run XQuery (or something like it) on them. If that
>> were possible, you'd get very fast, high octane analysis with
>> off-the-shelf software (a lot of it open source).
>
> Not directly related to the document at hand but for the record on the
> list archives, I think it should be noted of course that that is
> possible now. If you hook (say) the validator.nu sax parser up to any
> java based xslt/xquery engine, say saxon, then you can process html now
> with xml tools.

It will certainly help, but there are still a few things you have to be 
careful about in your XQuery. For instance <foo:bar> at some point in 
your content and //foo:bar might not match in any manner that you'd 
expect; I also don't think that you can compare with xs:QName("", 
"foo:bar"). But those are details for another group to figure out and 
not what occupies us here.

-- 
Robin Berjon - http://berjon.com/ - @robinberjon
Received on Monday, 3 December 2012 10:56:21 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:39:36 UTC