- From: Robin Berjon <robin@w3.org>
- Date: Mon, 03 Dec 2012 11:56:05 +0100
- To: David Carlisle <davidc@nag.co.uk>
- CC: public-html WG <public-html@w3.org>, www-tag@w3.org
On 03/12/2012 11:49 , David Carlisle wrote: > On 03/12/2012 10:35, Robin Berjon wrote: >> Case in point: we have a few large HTML datasets at hand which we >> can use to look at how HTML is used in the wild. But for the most >> part we tend to be limited to grepping, to some simple indexing, or >> to parsing them all and applying some ad hoc rules to extract data >> (which is slow). It would be sweet to be able to just load those >> into a DB and run XQuery (or something like it) on them. If that >> were possible, you'd get very fast, high octane analysis with >> off-the-shelf software (a lot of it open source). > > Not directly related to the document at hand but for the record on the > list archives, I think it should be noted of course that that is > possible now. If you hook (say) the validator.nu sax parser up to any > java based xslt/xquery engine, say saxon, then you can process html now > with xml tools. It will certainly help, but there are still a few things you have to be careful about in your XQuery. For instance <foo:bar> at some point in your content and //foo:bar might not match in any manner that you'd expect; I also don't think that you can compare with xs:QName("", "foo:bar"). But those are details for another group to figure out and not what occupies us here. -- Robin Berjon - http://berjon.com/ - @robinberjon
Received on Monday, 3 December 2012 10:56:20 UTC