Re: CSV2RDF and R2RML

Andy Seaborne wrote:
> On 12/02/14 16:57, Juan Sequeda wrote:
>> So... I believe I can bring some thoughts to the table wrt CSV to RDF.
>> Part of these thoughts come from conversations that I have had
>> previously with danbri.
>>
>> I saw in today's minutes that the RDB2RDF topic came up. I agree with
>> Axel that "CSV2RDF should be just a "dialect/small modification" of the
>> existing RDB2RDF spec". I actually encourage that there exists both a
>> Direct Mapping (completely automated mapping) and a modification of R2RML.
>>
>> The following issues arise:
>> - How do you know if the first column is a header or not.
>> - How do you know if there exists an id attribute/field which acts as a
>> unique identifier for the tuple (i.e primary key).
>>
>> Therefore, there needs to be a way to state this in a standard way. I'm
>> assuming this is going to go somewhere. Given this information, the
>> Direct Mapping standard should apply transparently (or so I believe at
>> this moment).
>>
>> Now with R2RML, I believe some changes need to be made. R2RML was made
>> to take advantage of SQL as much as possible; that is why you can define
>> a mapping on table or on a sql query. Take for example the following
>> R2RML mappings for Musicbrainz [4]. You can see that the tuples from
>> "SELECT * FROM artist WHERE artist.type = 1" are mapped to instances of
>> mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE
>> artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure
>> how to do this without a SQL engine. Therefore, should SQL engines be
>> involved in the CSV to RDF transformation?
>>
>> Another instance where R2RML relies heavily on SQL is when you want to
>> translate database codes into IRIs [5]. For example, if you have a code
>> value "eng" which should be mapped to some URI
>> http://example.com/engineering, which is part of a well defined
>> thesaurus/vocabulary.
>
> I agree that R2RML is a possible starting point and also that it does not
> apply automatically.
>
> There often isn't an explicit primary key nor proper foreign keys.  If the CSV
> conversion process can influence the CSV format, then there is a lot that can
> be done but if the CSV format is fixed, it may not be ideal.
I am not sure what you mean here when you say 'influence'. I presume we should
take the CSV format as a black box, meaning that this group will not (and should
not) define what CSV is and we should be able to take anything that is out there....

The attached metadata may be used to define a 'primary key', ie, a column that
serves as such, but I do not think we should even talk about foreign keys in
this context...

The problem with R2RML, as Juan's example above shows, that it goes very quickly
to the usage of SQL as a tool for all kinds of clever tricks. Which is great for
R2RML, but I do not think we should go down that route for CSV. I am curious
what our use case will reveal in this sense but my gut feeling tells me that
most of the CSV data out there are structurally simple (albeit possibly huge...)

Ivan
>
> A single table may be a denormalized view and somehow the data structuring
> needs to be put back into the output.
>
> It might be useful if we have a very simple concrete synthetic example to talk
> about in discussing conversion options.
>
> Here's a contribution:
>
> ----------------------------
> "Sales Region"," Quarter"," Sales"
> "North","Q1",10
> "North","Q2",15
> "North","Q3",7
> "North","Q4",25
> "South","Q1",9
> "South","Q2",15
> "South","Q3",16
> "South","Q4",31
> ----------------------------
>
> There are two sales regions, each with 4 sales results.
>
> This needs some kind of term resolution to turn e.g. "North" into a URI for
> the northern sales region.  It could be by an external lookup or by URI
> template as in R2RML. External lookup gives better linking.
>
> Defining "views" may help replacing the SQL with something.
>
> Using direct mapping seems to involve doing CSV->RDF, then RDF->RDF to lift
> out the data in a more useful form.  I have doubts about two stages processes
> like this because the real outcome is going the first step alone. 
> Rows-encoded-in-RDF then pushes the burden onto the data consumer; it's a
> barrier to reuse. Of course, it's easier to add mechanically. The whole area
> of times and dates is messy but important.
> Calculations might be done in a way that utilizes javascript, which given the
> likely audience makes it not all new technology, and is a a route to custom
> conversion.
>
>     Andy
>
>

Received on Tuesday, 18 February 2014 11:50:28 UTC