Re: HTML/XML TF Report glosses over Polyglot Markup (Was: Statement why the Polyglot doc should be informative) from David Carlisle on 2012-12-03 (www-tag@w3.org from December 2012)

From: David Carlisle <davidc@nag.co.uk>
Date: Mon, 03 Dec 2012 10:49:36 +0000
To: Robin Berjon <robin@w3.org>
CC: public-html WG <public-html@w3.org>, www-tag@w3.org
Message-ID: <50BC83C0.7070900@nag.co.uk>

On 03/12/2012 10:35, Robin Berjon wrote:

> Case in point: we have a few large HTML datasets at hand which we
> can use to look at how HTML is used in the wild. But for the most
> part we tend to be limited to grepping, to some simple indexing, or
> to parsing them all and applying some ad hoc rules to extract data
> (which is slow). It would be sweet to be able to just load those
> into a DB and run XQuery (or something like it) on them. If that
> were possible, you'd get very fast, high octane analysis with
> off-the-shelf software (a lot of it open source).

Not directly related to the document at hand but for the record on the
list archives, I think it should be noted of course that that is
possible now. If you hook (say) the validator.nu sax parser up to any
java based xslt/xquery engine, say saxon, then you can process html now
with xml tools.

There are not html parsers for every possible xml api so this isn't a
complete solution but it works well when it works.

David

Received on Monday, 3 December 2012 10:50:26 UTC