CSV2RDF and R2RML from Andy Seaborne on 2014-02-18 (public-csv-wg@w3.org from February 2014)

From: Andy Seaborne <andy@apache.org>
Date: Tue, 18 Feb 2014 11:25:26 +0000
To: public-csv-wg@w3.org
Message-ID: <53034326.9000908@apache.org>
On 12/02/14 16:57, Juan Sequeda wrote:
> So... I believe I can bring some thoughts to the table wrt CSV to RDF.
> Part of these thoughts come from conversations that I have had
> previously with danbri.
>
> I saw in today's minutes that the RDB2RDF topic came up. I agree with
> Axel that "CSV2RDF should be just a "dialect/small modification" of the
> existing RDB2RDF spec". I actually encourage that there exists both a
> Direct Mapping (completely automated mapping) and a modification of R2RML.
>
> The following issues arise:
> - How do you know if the first column is a header or not.
> - How do you know if there exists an id attribute/field which acts as a
> unique identifier for the tuple (i.e primary key).
>
> Therefore, there needs to be a way to state this in a standard way. I'm
> assuming this is going to go somewhere. Given this information, the
> Direct Mapping standard should apply transparently (or so I believe at
> this moment).
>
> Now with R2RML, I believe some changes need to be made. R2RML was made
> to take advantage of SQL as much as possible; that is why you can define
> a mapping on table or on a sql query. Take for example the following
> R2RML mappings for Musicbrainz [4]. You can see that the tuples from
> "SELECT * FROM artist WHERE artist.type = 1" are mapped to instances of
> mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE
> artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure
> how to do this without a SQL engine. Therefore, should SQL engines be
> involved in the CSV to RDF transformation?
>
> Another instance where R2RML relies heavily on SQL is when you want to
> translate database codes into IRIs [5]. For example, if you have a code
> value "eng" which should be mapped to some URI
> http://example.com/engineering, which is part of a well defined
> thesaurus/vocabulary.

I agree that R2RML is a possible starting point and also that it does 
not apply automatically.

There often isn't an explicit primary key nor proper foreign keys.  If 
the CSV conversion process can influence the CSV format, then there is a 
lot that can be done but if the CSV format is fixed, it may not be ideal.

A single table may be a denormalized view and somehow the data 
structuring needs to be put back into the output.

It might be useful if we have a very simple concrete synthetic example 
to talk about in discussing conversion options.

Here's a contribution:

----------------------------
"Sales Region"," Quarter"," Sales"
"North","Q1",10
"North","Q2",15
"North","Q3",7
"North","Q4",25
"South","Q1",9
"South","Q2",15
"South","Q3",16
"South","Q4",31
----------------------------

There are two sales regions, each with 4 sales results.

This needs some kind of term resolution to turn e.g. "North" into a URI 
for the northern sales region.  It could be by an external lookup or by 
URI template as in R2RML. External lookup gives better linking.

Defining "views" may help replacing the SQL with something.

Using direct mapping seems to involve doing CSV->RDF, then RDF->RDF to 
lift out the data in a more useful form.  I have doubts about two stages 
processes like this because the real outcome is going the first step 
alone.  Rows-encoded-in-RDF then pushes the burden onto the data 
consumer; it's a barrier to reuse. Of course, it's easier to add 
mechanically.

The whole area of times and dates is messy but important.

Calculations might be done in a way that utilizes javascript, which 
given the likely audience makes it not all new technology, and is a a 
route to custom conversion.

 Andy
Received on Tuesday, 18 February 2014 11:25:59 UTC