Intro and Thoughts on CSV2RDF from Juan Sequeda on 2014-02-12 (public-csv-wg@w3.org from February 2014)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Wed, 12 Feb 2014 10:57:30 -0600
To: public-csv-wg@w3.org
Message-ID: <CAMVTWDxXJVErY8V8g1eR-0=kE6ocSkKCeQij8RwQkuaNURD6iA@mail.gmail.com>
All,

Quick intro (even though I did make an intro on the first call): I'm
finishing my PhD in CS at UT Austin. My research focuses on the integration
of relational databases with the semantic web. A result of my research is
Ultrawrap [1], a Relational Database to RDF (RDB2RDF) system capable of
running SPARQL queries as fast as SQL queries. Ultrawrap has been
productized and is compliant with the W3C Direct Mapping and R2RML
standards for RDB2RDF. Ultrawrap is currently being commercialized by my
startup, Capsenta [2]. Ultrawrap is being used to generate the RDF dumps of
Musicbrainz. Additionally, the data behind Constitute Project [3] comes
from hundreds of CSVs and converted to RDF using Ultrawrap.  I've been
involved in the RDB2RDF space since the first workshop in 2007, XG, WG,
editor of the Direct Mapping spec and implementor of both standards.

So... I believe I can bring some thoughts to the table wrt CSV to RDF. Part
of these thoughts come from conversations that I have had previously with
danbri.

I saw in today's minutes that the RDB2RDF topic came up. I agree with Axel
that "CSV2RDF should be just a "dialect/small modification" of the existing
RDB2RDF spec". I actually encourage that there exists both a Direct Mapping
(completely automated mapping) and a modification of R2RML.

The following issues arise:
- How do you know if the first column is a header or not.
- How do you know if there exists an id attribute/field which acts as a
unique identifier for the tuple (i.e primary key).

Therefore, there needs to be a way to state this in a standard way. I'm
assuming this is going to go somewhere. Given this information, the Direct
Mapping standard should apply transparently (or so I believe at this
moment).

Now with R2RML, I believe some changes need to be made. R2RML was made to
take advantage of SQL as much as possible; that is why you can define a
mapping on table or on a sql query. Take for example the following R2RML
mappings for Musicbrainz [4]. You can see that the tuples from "SELECT *
FROM artist WHERE artist.type = 1" are mapped to instances of
mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE
artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure how
to do this without a SQL engine. Therefore, should SQL engines be involved
in the CSV to RDF transformation?

Another instance where R2RML relies heavily on SQL is when you want to
translate database codes into IRIs [5]. For example, if you have a code
value "eng" which should be mapped to some URI
http://example.com/engineering, which is part of a well defined
thesaurus/vocabulary.

We have implemented a CSV2RDF in Ultrawrap which uses the Direct Mapping
and R2RML standards as-is. The only assumption we have at the moment is
that the first column is a header and the first attribute acts as a primary
key.

Another topic I've discussed with danbri is if you have a set of csv, which
basically are the CSV dumps of all the relational tables of a database.
Therefore, implicitly there are foreign keys. There should be a way to
describe the relationships (foreign keys) between different CSVs.

These are my initial thoughts. Looking forward to hearing what others have
to say.

[1] http://www.sciencedirect.com/science/article/pii/S1570826813000383
[2] http://www.capsenta.com/
[3] https://www.constituteproject.org/
[4]
https://github.com/LinkedBrainz/MusicBrainz-R2RML/blob/master/mappings/artist.ttl
[5] http://www.w3.org/TR/r2rml/#example-translationtable

Juan Sequeda
+1-575-SEQ-UEDA
www.juansequeda.com
Received on Wednesday, 12 February 2014 16:58:18 UTC