Re: CSV2RDF and R2RML

Andy,


On Tue, Feb 18, 2014 at 5:25 AM, Andy Seaborne <andy@apache.org> wrote:

> On 12/02/14 16:57, Juan Sequeda wrote:
>
>> So... I believe I can bring some thoughts to the table wrt CSV to RDF.
>> Part of these thoughts come from conversations that I have had
>> previously with danbri.
>>
>> I saw in today's minutes that the RDB2RDF topic came up. I agree with
>> Axel that "CSV2RDF should be just a "dialect/small modification" of the
>> existing RDB2RDF spec". I actually encourage that there exists both a
>> Direct Mapping (completely automated mapping) and a modification of R2RML.
>>
>> The following issues arise:
>> - How do you know if the first column is a header or not.
>> - How do you know if there exists an id attribute/field which acts as a
>> unique identifier for the tuple (i.e primary key).
>>
>> Therefore, there needs to be a way to state this in a standard way. I'm
>> assuming this is going to go somewhere. Given this information, the
>> Direct Mapping standard should apply transparently (or so I believe at
>> this moment).
>>
>> Now with R2RML, I believe some changes need to be made. R2RML was made
>> to take advantage of SQL as much as possible; that is why you can define
>> a mapping on table or on a sql query. Take for example the following
>> R2RML mappings for Musicbrainz [4]. You can see that the tuples from
>> "SELECT * FROM artist WHERE artist.type = 1" are mapped to instances of
>> mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE
>> artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure
>> how to do this without a SQL engine. Therefore, should SQL engines be
>> involved in the CSV to RDF transformation?
>>
>> Another instance where R2RML relies heavily on SQL is when you want to
>> translate database codes into IRIs [5]. For example, if you have a code
>> value "eng" which should be mapped to some URI
>> http://example.com/engineering, which is part of a well defined
>> thesaurus/vocabulary.
>>
>
> I agree that R2RML is a possible starting point and also that it does not
> apply automatically.
>
> There often isn't an explicit primary key nor proper foreign keys.


That is why I suggest that there is a standard way of defining which
attribute/column can "act" as a primary key.

Same for foreign key. But that may be putting the cart ahead of the horse
at the moment.


>  If the CSV conversion process can influence the CSV format, then there is
> a lot that can be done but if the CSV format is fixed, it may not be ideal.
>

What do you mean by "CSV conversion process"? Is it some pre processing
step (maybe done as a SQL query) before the data gets generated as a CSV?


> A single table may be a denormalized view and somehow the data structuring
> needs to be put back into the output.
>
> It might be useful if we have a very simple concrete synthetic example to
> talk about in discussing conversion options.
>
> Here's a contribution:
>
> ----------------------------
> "Sales Region"," Quarter"," Sales"
> "North","Q1",10
> "North","Q2",15
> "North","Q3",7
> "North","Q4",25
> "South","Q1",9
> "South","Q2",15
> "South","Q3",16
> "South","Q4",31
> ----------------------------
>
> There are two sales regions, each with 4 sales results.
>
> This needs some kind of term resolution to turn e.g. "North" into a URI
> for the northern sales region.  It could be by an external lookup or by URI
> template as in R2RML. External lookup gives better linking.
>
> Defining "views" may help replacing the SQL with something.
>

In this example, what would be the subject?


>
> Using direct mapping seems to involve doing CSV->RDF, then RDF->RDF to
> lift out the data in a more useful form.  I have doubts about two stages
> processes like this because the real outcome is going the first step alone.
>  Rows-encoded-in-RDF then pushes the burden onto the data consumer; it's a
> barrier to reuse. Of course, it's easier to add mechanically.
>
>
The Direct mapping is useful to have quick/dirty RDF. If your schema is
normalized, or in this case, if the CSV is normalized, then the RDF that
comes out is fairly "good".

Additionally, there may be users who would want to do a RDF->RDF
transformation. This is where Direct Mapping helps

If users want to a CSV -> RDF transformation, then this is where a mapping
language comes in.

Nevertheless, I'm a huge advocate for automation, hence the Direct Mapping.
Actually, we have been observing that Ultrawrap's users usually first run
the Direct Mapping. The resulting mapping is represented as R2RML. Then
they go in to edit the R2RML mapping.


> The whole area of times and dates is messy but important.
>
> Calculations might be done in a way that utilizes javascript, which given
> the likely audience makes it not all new technology, and is a a route to
> custom conversion.
>
>         Andy
>
>
>

Received on Tuesday, 18 February 2014 15:04:43 UTC