A SPARQL extension for transforming XML to RDF (and more)

Dear RAX group,


I have not participated in the discussions of this group, silently 
observing the emails of the mailing list. Yet, I have something that 
could be of interest to you.

My colleague Maxime (in CC) and I have worked on making a language for 
expressing a transformation from documents in any format to RDF, with 
the idea that the language should be as close as possible to SPARQL 
(thus can be implemented easily by extending a SPARQL engine).

The result is SPARQL generate [1,2,3]. It works for generating RDF from 
any source format, so it's applicable to XML in particular (we do not 
deal with RDF-to-XML generation, though). Here is roughly how it works:

  - You use XPath to specify the pieces of the source document you want 
to extract and bind the result to a SPARQL variable. This is done by 
implementing a SPARQL custom function, thus making use of the standard 
extension mechanism of SPARQL + the standard binding mechanism of SPARQL.
  - Since this is bound to a variable, the result of the XPath selection 
can be used within a standard SPARQL expression.
  - We allow extraction from multiple files at the same time, so 
information from several sources can be crossed, combined, processed, 
etc. and the result is bound to a variable using a BIND clause.
  - The variables are then injected into a graph pattern in a similar 
way as for CONSTRUCT queries. However, we do not reuse the CONSTRUCT 
clause because we allow nested graph pattern generation.

Since the language extends SPARQL and the implementation extends a 
SPARQL engine, it is possible to include the XML extraction inside a 
normal SPARQL query pattern over a triple store (or over multiple triple 
stores with the SERVICE clause).

The tool is not limited to XML-to-RDF generation. Any combination of 
formats can be used as source files, thanks to a number of custom 
functions: JSON-path for JSON or CBOR, CSS selectors for HTML/XML, regex 
selectors for arbitrary text files, date and time conversion functions, 
and more.

The web site [1] provides an online interface for testing, many examples 
and test cases of various levels of complexity, a command line tool in 
the form of an executable jar, the source code of our implementation 
(extending Jena) and a little documentation/tutorial (to be improved).

We are working on improvements: syntactic sugar to make writing queries 
much easier and support for data streams.

If you are interested in further information, please contact us. If you 
are using it, please let us know! We are of course eager to know who our 
user base is composed of.


Regards,
--AZ

[1] SPARQL generate official web site: http://ci.emse.fr/sparql-generate/
[2] Maxime Lefrançois, Antoine Zimmermann, Noorani Bakerally. Flexible 
RDF generation from RDF and heterogeneous data sources with 
SPARQL-Generate, In Proc. the 20th International Conference on Knowledge 
Engineering and Knowledge Management, EKAW, Nov 2016, Bologna, Italy 
(demo track). 
http://www.maxime-lefrancois.info/docs/LefrancoisZimmermannBakerally-EKAW2016-Flexible.pdf
[3] Maxime Lefrançois, Antoine Zimmermann, Noorani Bakerally. Maxime 
Lefrançois, Antoine Zimmermann, Noorani Bakerally A SPARQL extension for 
generating RDF from heterogeneous formats, In Proc. Extended Semantic 
Web Conference, ESWC, May 2017, Portoroz, Slovenia. 
http://www.maxime-lefrancois.info/docs/LefrancoisZimmermannBakerally-ESWC2017-Generate.pdf
-- 
Antoine Zimmermann
Institut Henri Fayol
École des Mines de Saint-Étienne
158 cours Fauriel
CS 62362
42023 Saint-Étienne Cedex 2
France
Tél:+33(0)4 77 42 66 03
Fax:+33(0)4 77 42 66 66
http://www.emse.fr/~zimmermann/
Member of team Connected Intelligence, Laboratoire Hubert Curien

Received on Tuesday, 18 July 2017 15:52:29 UTC