Re: HTML/XML TF Report glosses over Polyglot Markup (Was: Statement why the Polyglot doc should be informative)

On 03/12/2012 10:35, Robin Berjon wrote:


> Case in point: we have a few large HTML datasets at hand which we
> can use to look at how HTML is used in the wild. But for the most
> part we tend to be limited to grepping, to some simple indexing, or
> to parsing them all and applying some ad hoc rules to extract data
> (which is slow). It would be sweet to be able to just load those
> into a DB and run XQuery (or something like it) on them. If that
> were possible, you'd get very fast, high octane analysis with
> off-the-shelf software (a lot of it open source).


Not directly related to the document at hand but for the record on the
list archives, I think it should be noted of course that that is
possible now. If you hook (say) the validator.nu sax parser up to any
java based xslt/xquery engine, say saxon, then you can process html now
with xml tools.

There are not html parsers for every possible xml api so this isn't a
complete solution but it works well when it works.

David

Received on Monday, 3 December 2012 10:50:26 UTC