RE: Intro and Thoughts on CSV2RDF from Tandy, Jeremy on 2014-02-14 (public-csv-wg@w3.org from February 2014)

From: Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk>
Date: Fri, 14 Feb 2014 10:31:08 +0000
To: Juan Sequeda <juanfederico@gmail.com>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <2624871D9A05174691BD59F8EFD68AE2B3258D@EXXCMPD1DAG3.cmpd1.metoffice.gov.uk>
Hi Juan - thanks for your thoughts. I note that currently we don't have any use cases on the wiki<https://www.w3.org/2013/csvw/wiki/Use_Cases> that discuss publication of (semantically enabled) CSV from relational database tables. We need use cases, expressed as a user-driven / outcome-driven narrative (e.g. tell a story about what someone is trying to achieve rather than use abstract functional requirements) with real data examples, in order to establish our requirements for the eventual spec.

Can you add something to the wiki please so that I can incorporate it into the "Use cases and requirements" documentation?

Many thanks, Jeremy

From: Juan Sequeda [mailto:juanfederico@gmail.com]
Sent: 12 February 2014 16:58
To: public-csv-wg@w3.org
Subject: Intro and Thoughts on CSV2RDF

All,

Quick intro (even though I did make an intro on the first call): I'm finishing my PhD in CS at UT Austin. My research focuses on the integration of relational databases with the semantic web. A result of my research is Ultrawrap [1], a Relational Database to RDF (RDB2RDF) system capable of running SPARQL queries as fast as SQL queries. Ultrawrap has been productized and is compliant with the W3C Direct Mapping and R2RML standards for RDB2RDF. Ultrawrap is currently being commercialized by my startup, Capsenta [2]. Ultrawrap is being used to generate the RDF dumps of Musicbrainz. Additionally, the data behind Constitute Project [3] comes from hundreds of CSVs and converted to RDF using Ultrawrap.  I've been involved in the RDB2RDF space since the first workshop in 2007, XG, WG, editor of the Direct Mapping spec and implementor of both standards.

So... I believe I can bring some thoughts to the table wrt CSV to RDF. Part of these thoughts come from conversations that I have had previously with danbri.

I saw in today's minutes that the RDB2RDF topic came up. I agree with Axel that "CSV2RDF should be just a "dialect/small modification" of the existing RDB2RDF spec". I actually encourage that there exists both a Direct Mapping (completely automated mapping) and a modification of R2RML.

The following issues arise:
- How do you know if the first column is a header or not.
- How do you know if there exists an id attribute/field which acts as a unique identifier for the tuple (i.e primary key).

Therefore, there needs to be a way to state this in a standard way. I'm assuming this is going to go somewhere. Given this information, the Direct Mapping standard should apply transparently (or so I believe at this moment).

Now with R2RML, I believe some changes need to be made. R2RML was made to take advantage of SQL as much as possible; that is why you can define a mapping on table or on a sql query. Take for example the following R2RML mappings for Musicbrainz [4]. You can see that the tuples from "SELECT * FROM artist WHERE artist.type = 1" are mapped to instances of mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure how to do this without a SQL engine. Therefore, should SQL engines be involved in the CSV to RDF transformation?

Another instance where R2RML relies heavily on SQL is when you want to translate database codes into IRIs [5]. For example, if you have a code value "eng" which should be mapped to some URI http://example.com/engineering, which is part of a well defined thesaurus/vocabulary.

We have implemented a CSV2RDF in Ultrawrap which uses the Direct Mapping and R2RML standards as-is. The only assumption we have at the moment is that the first column is a header and the first attribute acts as a primary key.

Another topic I've discussed with danbri is if you have a set of csv, which basically are the CSV dumps of all the relational tables of a database. Therefore, implicitly there are foreign keys. There should be a way to describe the relationships (foreign keys) between different CSVs.

These are my initial thoughts. Looking forward to hearing what others have to say.

[1] http://www.sciencedirect.com/science/article/pii/S1570826813000383
[2] http://www.capsenta.com/
[3] https://www.constituteproject.org/
[4] https://github.com/LinkedBrainz/MusicBrainz-R2RML/blob/master/mappings/artist.ttl
[5] http://www.w3.org/TR/r2rml/#example-translationtable

Juan Sequeda
+1-575-SEQ-UEDA
www.juansequeda.com<http://www.juansequeda.com>
Received on Friday, 14 February 2014 10:31:38 UTC