Re: CSV2RDF and R2RML from Andy Seaborne on 2014-02-20 (public-csv-wg@w3.org from February 2014)

From: Andy Seaborne <andy@apache.org>
Date: Thu, 20 Feb 2014 14:50:43 +0000
To: Ivan Herman <ivan@w3.org>, Juan Sequeda <juanfederico@gmail.com>
CC: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <53061643.7010005@apache.org>
Juan,

On 18/02/14 16:34, Ivan Herman wrote:
>
>
> Juan Sequeda wrote:
>> Andy,
>>
>>
>> On Tue, Feb 18, 2014 at 5:25 AM, Andy Seaborne <andy@apache.org
>> <mailto:andy@apache.org>> wrote:
>>
>>      On 12/02/14 16:57, Juan Sequeda wrote:
>>
>>          So... I believe I can bring some thoughts to the table wrt CSV to RDF.
>>          Part of these thoughts come from conversations that I have had
>>          previously with danbri.
>>
>>          I saw in today's minutes that the RDB2RDF topic came up. I agree with
>>          Axel that "CSV2RDF should be just a "dialect/small modification" of the
>>          existing RDB2RDF spec". I actually encourage that there exists both a
>>          Direct Mapping (completely automated mapping) and a modification of R2RML.
>>
>>          The following issues arise:
>>          - How do you know if the first column is a header or not.
>>          - How do you know if there exists an id attribute/field which acts as a
>>          unique identifier for the tuple (i.e primary key).
>>
>>          Therefore, there needs to be a way to state this in a standard way. I'm
>>          assuming this is going to go somewhere. Given this information, the
>>          Direct Mapping standard should apply transparently (or so I believe at
>>          this moment).
>>
>>          Now with R2RML, I believe some changes need to be made. R2RML was made
>>          to take advantage of SQL as much as possible; that is why you can define
>>          a mapping on table or on a sql query. Take for example the following
>>          R2RML mappings for Musicbrainz [4]. You can see that the tuples from
>>          "SELECT * FROM artist WHERE artist.type = 1" are mapped to instances of
>>          mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE
>>          artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure
>>          how to do this without a SQL engine. Therefore, should SQL engines be
>>          involved in the CSV to RDF transformation?
>>
>>          Another instance where R2RML relies heavily on SQL is when you want to
>>          translate database codes into IRIs [5]. For example, if you have a code
>>          value "eng" which should be mapped to some URI
>>          http://example.com/engineering, which is part of a well defined
>>          thesaurus/vocabulary.
>>
>>
>>      I agree that R2RML is a possible starting point and also that it does not
>>      apply automatically.
>>
>>      There often isn't an explicit primary key nor proper foreign keys.
>>
>>
>> That is why I suggest that there is a standard way of defining which
>> attribute/column can "act" as a primary key.
>
> Right. This should be part of the metadata that we will define for CSV anyway
>>
>> Same for foreign key. But that may be putting the cart ahead of the horse at
>> the moment.
>
> +1 for the putting the card...
>
> Ivan
>
>>
>>
>>       If the CSV conversion process can influence the CSV format, then there is
>>      a lot that can be done but if the CSV format is fixed, it may not be ideal.
>>
>>
>> What do you mean by "CSV conversion process"? Is it some pre processing step
>> (maybe done as a SQL query) before the data gets generated as a CSV?
>>
>>
>>      A single table may be a denormalized view and somehow the data structuring
>>      needs to be put back into the output.
>>
>>      It might be useful if we have a very simple concrete synthetic example to
>>      talk about in discussing conversion options.
>>
>>      Here's a contribution:
>>
>>      ----------------------------
>>      "Sales Region"," Quarter"," Sales"
>>      "North","Q1",10
>>      "North","Q2",15
>>      "North","Q3",7
>>      "North","Q4",25
>>      "South","Q1",9
>>      "South","Q2",15
>>      "South","Q3",16
>>      "South","Q4",31
>>      ----------------------------
>>
>>      There are two sales regions, each with 4 sales results.
>>
>>      This needs some kind of term resolution to turn e.g. "North" into a URI
>>      for the northern sales region.  It could be by an external lookup or by
>>      URI template as in R2RML. External lookup gives better linking.
>>
>>      Defining "views" may help replacing the SQL with something.
>>
>>
>> In this example, what would be the subject?

While we could use the row number as the basis of the primary key I 
think that *may* lead to low-value data.

Just because you can convert a data table to some RDF, if the URIs are 
all locally generated, I'm not sure there is strong value in a standard 
here.

In this example would ideally use "North" to resolve to a URI in the 
corporate data dictionary because the "Sales Region" columns I known to 
be a key (inverse function property).

"North" need not appear in the output.

Give:

prefix corp: <http://mycorp/globalDataDictionary/>

corp:region1 :Name "North" .
corp:region2 :Name "South" .

We might get from row one:

corp:region1 :sales [ :period "Q1" ; :value 10 ] .

(including a blank node - a separate discussion! - let's use generated 
ids for now:)

corp:region1 :sales gen:57 ;
gen:57 :period "Q1" ;
        :value 10  .


or a different style:

<http://corp/file/row1>
        :region corp:region1 ;
        :period "Q1 ;
        :sales 10 .



In my limited exposure to R2RML usage, the majority has been direct 
mapping, with the app (SPARQL queries) directly and crudely pulling 
values out of the data.  There is no RDF to RDF uplifting.  It seems to 
be a caused by the need from upfront investment and mixing 
responsibilities of access and modelling.

The better full mapping language of R2RML does not get the investment 
(quality of tools seems to be an issue - too much expectation of free 
open source maybe?).

Being "devils advocate" here ...
I do wonder if the WG really does need to produce a *standardised* CSV 
to RDF mapping or whether the most important part is to add the best 
metadata to the CSV file and let different approaches flourish.

This is based on looking at the role and responsibilities in the 
publishing chain: the publisher provides CSV files and the metadata - do 
they provide the RDF processing alogorithm as well?  Or does that 
involve consideration by the data consumer on how they intend to use the 
tabular data?

	Andy

>>
>>
>>
>>      Using direct mapping seems to involve doing CSV->RDF, then RDF->RDF to
>>      lift out the data in a more useful form.  I have doubts about two stages
>>      processes like this because the real outcome is going the first step
>>      alone.  Rows-encoded-in-RDF then pushes the burden onto the data consumer;
>>      it's a barrier to reuse. Of course, it's easier to add mechanically.
>>
>>
>> The Direct mapping is useful to have quick/dirty RDF. If your schema is
>> normalized, or in this case, if the CSV is normalized, then the RDF that comes
>> out is fairly "good".
>>
>> Additionally, there may be users who would want to do a RDF->RDF
>> transformation. This is where Direct Mapping helps
>>
>> If users want to a CSV -> RDF transformation, then this is where a mapping
>> language comes in.
>>
>> Nevertheless, I'm a huge advocate for automation, hence the Direct Mapping.
>> Actually, we have been observing that Ultrawrap's users usually first run the
>> Direct Mapping. The resulting mapping is represented as R2RML. Then they go in
>> to edit the R2RML mapping.
>>
>>
>>      The whole area of times and dates is messy but important.
>>
>>      Calculations might be done in a way that utilizes javascript, which given
>>      the likely audience makes it not all new technology, and is a a route to
>>      custom conversion.
>>
>>              Andy
>>
>>
>>
>
Received on Thursday, 20 February 2014 14:51:18 UTC