R2RML, v 1.65 2011/06/15: implementation experience. from Ivan Mikhailov on 2011-06-15 (public-rdb2rdf-comments@w3.org from June 2011)

From: Ivan Mikhailov <imikhailov@openlinksw.com>
Date: Thu, 16 Jun 2011 04:05:40 +0700
To: public-rdb2rdf-comments@w3.org
Message-ID: <1308171940.21233.322.camel@octo.iv.dev.null>
Hi all,

I've made a translator from R2RML to declarations of OpenLink Virtuoso's
RDF Views. It was interesting and in some cases it was funny, because it
was a nice sandbox to play with SPARQL and to push processing into
SPARQL queries on R2RML resources, push as much as possible. If I were
an initial developer of it but not a maintainer I'd rather write an XSLT
with SPARQL injections, just to make things even more interesting.

The result is not bad. It took only 900 lines of code. So the source
representation is proven to be convenient, again.

Examples are accurate and can be used "as is" for first tests, except
trivial missing semicolon after

rr:usePredicateObjectMap 
    [ 
      rr:usePredicateMap [ rr:predicate emp:job ]; 
      rr:useObjectMap    [ rr:column "job" ]
    ]

in A 2.2.1, A 2.2.2 and A 2.2.3.

Other minor problem hides in obsolete fig 1b --- fig 9. I'd be lazy to
patch figures frequently so I'd label them "deprecated" for a while.

The generated text of RDF Views is not perfect, it's rather a draft for
review and for assigning some meaningful names to individual mapping
rules, for readability of future error diagnostics etc. I'll probably
extend source R2RMLs with rdfs:labels, comments etc. in order to make
the output more readable.

Further works: inverse expressions

I ignore rr:inverseExpression-s, because most of rr:template-s are
compiled into format strings for Virtuoso's sprintf() function and
Virtuoso has sprintf_inverse() string parsing function that is smart
enough to eliminate the need for 95% of handwritten URI parsing. Maybe I
should detect the remaining 5% and process rr:inverseExpression-s for
them, but the priority of this improvement is low.

Further works: validation

The open issue for me is the validation of input. The examples use only
rr:TriplesMap as an explicitly declared type, types of the rest of
(blank) nodes are defined implicitly as ranges of predicates in use. No
doubt, that's how people will write their own R2RML resources,
especially if they will write Turtle. However I'm not sure what's the
best policy for validation. E.g., one may decide to create a (supposedly
rr:SubjectMap) node and use it as value of both rr:useSubjectMap and
rr:useObjectMap predicates in different places, should I warn about
rr:graph in rr:useObjectMap after that?
If types are not declared explicitly, should I first infer them and then
warn about multiple types assigned to same node? Which classes are
supposed to be disjoint?
Right now I've sabotaged the coding of the validator, eliminating the
problem, but that's not a universal solution.

Further works: tests and tutorial examples

OpenLink Software participates in Linking Open Data - 2 project (FP7
LOD2), and we will provide an RDF "remake" of the TPC-H benchmark,
codename RDF-H. I intend to write an R2RML file that will map canonical
TPC-H tables to the RDF-H graph and report if the mapping is adequate.
That will be both the test for my R2RML translator and a half-real-life
use case for R2RML itself. Same could be done for BSBM benchmark.

Further works: release

The mentioned R2RML translator will appear in Virtuoso Open Source
release, so that will be one of "independent implementations" of the
spec for its Candidate Recommendation phase.


Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com
Received on Wednesday, 15 June 2011 21:06:08 UTC