- From: David Carlisle <davidc@nag.co.uk>
- Date: Mon, 03 Dec 2012 10:49:36 +0000
- To: Robin Berjon <robin@w3.org>
- CC: public-html WG <public-html@w3.org>, www-tag@w3.org
On 03/12/2012 10:35, Robin Berjon wrote: > Case in point: we have a few large HTML datasets at hand which we > can use to look at how HTML is used in the wild. But for the most > part we tend to be limited to grepping, to some simple indexing, or > to parsing them all and applying some ad hoc rules to extract data > (which is slow). It would be sweet to be able to just load those > into a DB and run XQuery (or something like it) on them. If that > were possible, you'd get very fast, high octane analysis with > off-the-shelf software (a lot of it open source). Not directly related to the document at hand but for the record on the list archives, I think it should be noted of course that that is possible now. If you hook (say) the validator.nu sax parser up to any java based xslt/xquery engine, say saxon, then you can process html now with xml tools. There are not html parsers for every possible xml api so this isn't a complete solution but it works well when it works. David
Received on Monday, 3 December 2012 10:50:27 UTC