Re: Best way for exposing Linked Open Data. Wrapper vs scrape from Claus Stadler on 2013-06-07 (public-lod@w3.org from June 2013)

From: Claus Stadler <cstadler@informatik.uni-leipzig.de>
Date: Fri, 07 Jun 2013 12:05:58 +0200
To: public-lod@w3.org
Message-ID: <51B1B086.4010609@informatik.uni-leipzig.de>
Hi,

I am the creator of Sparqlify[1], a SPARQL to SQL rewriter,
which we are developing and using for publishing the relational 
OpenStreetMap[2] database as RDF
in the course of the LinkedGeoData (LGD) project[3], an thus currently 
serves 20 billion virtual triples.

But so far we have applied the tool successfully to other databases 
(Wortschatz, PanLex) and numerous CSV files on CKAN (see [4]) as well.

Currently, the latest snapshot of Sparqlify is packed automatically on 
successful build (this includes testing against the R2RML test suite) as 
a Debian package at [5].

This Deb contains the scripts 'sparqlify' and 'sparqlify-csv', whereas 
the former is for databases (tested with Postgres and H2, but not mysql 
yet) whereas the latter is for csv files.
Another script / war file, that bundles a Linked Data inteface and the 
HTML SPARQL interface will follow shortly.

Anectodal evidence by myself and students of mine suggests that the used 
mapping language SML (Sparqlification Mapping Language) is pretty 
straight forward to use and in regard to expressivity essentially 
equivalent to R2RML (except for a current lack of support of inverse 
expressions).
I recommend to look at the mappings of LinkedGeoData [6] and judge for 
yourself.

R2RML <-> SML conversion support is underway, but will take probably 
about a couple of months before release.
I started writing down the documentation of SML at [7].

An official Debian package will become part of the LOD2 stack[8] this month.

So if anyone is interested in trying out Sparqlify, feedback and 
suggestions for improvements are much welcome (please use the Github 
issue tracker for any bugs) ;)

Cheers,
Claus


Here are a quick example for LGD:

- Number of triples contributer by user 666:
http://linkedgeodata.org/vsnorql/?query=Select+%28Count%28*%29+As+%3Fc%29+{%0D%0A++++%3Fs+dcterms%3Acontributor+lgd%3Auser666+.%0D%0A++++%3Fs+%3Fp+%3Fo+.%0D%0A}

- A nice feature is the EXLPAIN keyword: It helps one to review the 
generated SQL and spot performance bottlenecks.
http://linkedgeodata.org/vsnorql/?query=Explain+Select+*+{%0D%0A++++%3Fs+dcterms%3Acontributor+lgd%3Auser666+.%0D%0A++++%3Fs+%3Fp+%3Fo+.%0D%0A}


[1] https://github.com/AKSW/Sparqlify
[2] http://www.openstreetmap.org/
[3] http://www.linkedgeodata.org/
[4] http://ld.panlex.org/rdf.html
       http://wiki.publicdata.eu/wiki/CSV2RDF_Application
[5] http://cstadler.aksw.org/repos/apt/pool/main/s/sparqlify/
[6] 
https://github.com/GeoKnow/LinkedGeoData/blob/master/linkedgeodata-core/src/main/sparqlify/LinkedGeoData-Triplify-IndividualViews.sparqlify
[7] http://sparqlify.org/wiki/SML
[8] http://stack.lod2.eu/


On 05/28/2013 10:18 AM, Luca Matteis wrote:
> Here's my scenario: I have several different datasets. Most in MySQL 
> databases. Some in PostrgreSQL. Others in MS Access. Many in CSV. Each 
> one of these datasets is maintained by its own group of people.
>
> Now, my end goal is to have all these datasets published as 5 stars 
> Linked Open Data. But I am in doubt between these two solutions:
>
> 1) Give a generic wrapper tool to each of these groups of people, that 
> would basically convert their datasets to RDF, and allow them to 
> publish this data as LOD automatically. This tool would allow them to 
> publish LOD on their own, using their own server (does such a generic 
> tool even exist? Can it even be built?).
>
> 2) Scrape these datasets, which are at times simply published on the 
> Web as HTML paginated tables, or published as dumps on their server, 
> for example a .CSV dump of their entire database. Then I would 
> aggregate all these various datasets myself, and publish them as 
> Linked Data.
>
> Pros and cons for each of these methods? Any other ideas?
>
> Thanks!


-- 
Dipl. Inf. Claus Stadler
Department of Computer Science, University of Leipzig
Research Group:http://aksw.org/
Workpage & WebID:http://aksw.org/ClausStadler
Phone: +49 341 97-32260
Received on Friday, 7 June 2013 10:06:29 UTC