student project idea: RDF/RDFa parser QA via automatic test-suite generation from Dan Brickley on 2008-11-18 (semantic-web@w3.org from November 2008)

From: Dan Brickley <danbri@danbri.org>
Date: Tue, 18 Nov 2008 10:08:17 +0100
To: Semantic Web <semantic-web@w3.org>
Cc: RDFa <public-rdf-in-xhtml-tf@w3.org>
Message-ID: <49228601.3040904@danbri.org>

Hi all (but especially students and academic staff),

Yesterday I found a bug in Redland's librdfa-based RDFa parsing 
facilities. A fairly obscure markup pattern caused the librdfa library 
to fail to generate an RDF triple. Redland/raptor deals with this by 
throwing a fatal error, bringing my RDFa-parsing ambitions to a grinding 
halt. This was on input data I'd generated myself (the curious can see 
details at http://bugs.librdf.org/mantis/view.php?id=289 ).

If RDF (and especially RDFa) parsers are going to be robustly handle all 
the scary messy markup that's out there, then I don't think we can wait 
for humans like me to stumble upon the awkward corner cases that trip 
them up. So I've a proposal (based on some old work by Janne Saarela):

I'd like to see an auto-generated repository of RDFa samples, most (but 
not all) of which are decent wellformed XHTML with RDFa, but also with a 
good number of poorly-marked up files. Note that poor, confusing or 
downright weird markup may or may not trip up XML's wellformedness rules.

Here is an old set of RDF/XML test files autogenerated with Prolog:
http://www.w3.org/RDF/Test/Janne/

Related tools include the Dada Engine, http://dev.null.org/dadaengine/ 
(the tool behind http://www.elsewhere.org/pomo/ ) and Rmutt, 
http://www.schneertz.com/rmutt/ ... either of which could be used to 
make the output more entertaining.

Generating such a test set and then wiring it up to a set of RDFa 
parsers (via http://rdfa.digitalbazaar.com/rdfa-test-harness/ or 
something like it) shouldn't be a huge job, but it would be a very 
useful one. I'd like to see perhaps 1000 'nonsense' RDFa documents that 
experiment with every conceivable or inconceivable syntactic variant 
that parsers might encounter in the wild. And then find out (a) if any 
parsers completely fail with that input (b) what number and content of 
triples are generated (c) whether the spec gurus agree on what ought to 
be generated.

Does this sound worthwhile? Anyone willing to work on it or to help 
explore it as a student project? Students would gain an understanding of 
XML, RDFa grammars and on state of the art (and lack thereof ;) for 
automatic tool support for assuring compliance with the standards.

cheers,

Dan

--
http://danbri.org/

Received on Tuesday, 18 November 2008 09:08:55 UTC