Re: CSV2RDF and R2RML

Juan Sequeda
+1-575-SEQ-UEDA
www.juansequeda.com


On Thu, Feb 20, 2014 at 8:50 AM, Andy Seaborne <andy@apache.org> wrote:

> Juan,
>
>
> On 18/02/14 16:34, Ivan Herman wrote:
>
>>
>>
>> Juan Sequeda wrote:
>>
>>> Andy,
>>>
>>>
>>> On Tue, Feb 18, 2014 at 5:25 AM, Andy Seaborne <andy@apache.org
>>> <mailto:andy@apache.org>> wrote:
>>>
>>>      On 12/02/14 16:57, Juan Sequeda wrote:
>>>
>>>          So... I believe I can bring some thoughts to the table wrt CSV
>>> to RDF.
>>>          Part of these thoughts come from conversations that I have had
>>>          previously with danbri.
>>>
>>>          I saw in today's minutes that the RDB2RDF topic came up. I
>>> agree with
>>>          Axel that "CSV2RDF should be just a "dialect/small
>>> modification" of the
>>>          existing RDB2RDF spec". I actually encourage that there exists
>>> both a
>>>          Direct Mapping (completely automated mapping) and a
>>> modification of R2RML.
>>>
>>>          The following issues arise:
>>>          - How do you know if the first column is a header or not.
>>>          - How do you know if there exists an id attribute/field which
>>> acts as a
>>>          unique identifier for the tuple (i.e primary key).
>>>
>>>          Therefore, there needs to be a way to state this in a standard
>>> way. I'm
>>>          assuming this is going to go somewhere. Given this information,
>>> the
>>>          Direct Mapping standard should apply transparently (or so I
>>> believe at
>>>          this moment).
>>>
>>>          Now with R2RML, I believe some changes need to be made. R2RML
>>> was made
>>>          to take advantage of SQL as much as possible; that is why you
>>> can define
>>>          a mapping on table or on a sql query. Take for example the
>>> following
>>>          R2RML mappings for Musicbrainz [4]. You can see that the tuples
>>> from
>>>          "SELECT * FROM artist WHERE artist.type = 1" are mapped to
>>> instances of
>>>          mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE
>>>          artist.type = 2" are mapped to instances of mo:MusicGroup. I'm
>>> not sure
>>>          how to do this without a SQL engine. Therefore, should SQL
>>> engines be
>>>          involved in the CSV to RDF transformation?
>>>
>>>          Another instance where R2RML relies heavily on SQL is when you
>>> want to
>>>          translate database codes into IRIs [5]. For example, if you
>>> have a code
>>>          value "eng" which should be mapped to some URI
>>>          http://example.com/engineering, which is part of a well defined
>>>          thesaurus/vocabulary.
>>>
>>>
>>>      I agree that R2RML is a possible starting point and also that it
>>> does not
>>>      apply automatically.
>>>
>>>      There often isn't an explicit primary key nor proper foreign keys.
>>>
>>>
>>> That is why I suggest that there is a standard way of defining which
>>> attribute/column can "act" as a primary key.
>>>
>>
>> Right. This should be part of the metadata that we will define for CSV
>> anyway
>>
>>>
>>> Same for foreign key. But that may be putting the cart ahead of the
>>> horse at
>>> the moment.
>>>
>>
>> +1 for the putting the card...
>>
>> Ivan
>>
>>
>>>
>>>       If the CSV conversion process can influence the CSV format, then
>>> there is
>>>      a lot that can be done but if the CSV format is fixed, it may not
>>> be ideal.
>>>
>>>
>>> What do you mean by "CSV conversion process"? Is it some pre processing
>>> step
>>> (maybe done as a SQL query) before the data gets generated as a CSV?
>>>
>>>
>>>      A single table may be a denormalized view and somehow the data
>>> structuring
>>>      needs to be put back into the output.
>>>
>>>      It might be useful if we have a very simple concrete synthetic
>>> example to
>>>      talk about in discussing conversion options.
>>>
>>>      Here's a contribution:
>>>
>>>      ----------------------------
>>>      "Sales Region"," Quarter"," Sales"
>>>      "North","Q1",10
>>>      "North","Q2",15
>>>      "North","Q3",7
>>>      "North","Q4",25
>>>      "South","Q1",9
>>>      "South","Q2",15
>>>      "South","Q3",16
>>>      "South","Q4",31
>>>      ----------------------------
>>>
>>>      There are two sales regions, each with 4 sales results.
>>>
>>>      This needs some kind of term resolution to turn e.g. "North" into a
>>> URI
>>>      for the northern sales region.  It could be by an external lookup
>>> or by
>>>      URI template as in R2RML. External lookup gives better linking.
>>>
>>>      Defining "views" may help replacing the SQL with something.
>>>
>>>
>>> In this example, what would be the subject?
>>>
>>
> While we could use the row number as the basis of the primary key I think
> that *may* lead to low-value data.
>
> Just because you can convert a data table to some RDF, if the URIs are all
> locally generated, I'm not sure there is strong value in a standard here.
>
> In this example would ideally use "North" to resolve to a URI in the
> corporate data dictionary because the "Sales Region" columns I known to be
> a key (inverse function property).
>
> "North" need not appear in the output.
>
> Give:
>
> prefix corp: <http://mycorp/globalDataDictionary/>
>
> corp:region1 :Name "North" .
> corp:region2 :Name "South" .
>
> We might get from row one:
>
> corp:region1 :sales [ :period "Q1" ; :value 10 ] .
>
> (including a blank node - a separate discussion! - let's use generated ids
> for now:)
>
> corp:region1 :sales gen:57 ;
> gen:57 :period "Q1" ;
>        :value 10  .
>
>
> or a different style:
>
> <http://corp/file/row1>
>        :region corp:region1 ;
>        :period "Q1 ;
>        :sales 10 .
>
>
>
> In my limited exposure to R2RML usage, the majority has been direct
> mapping, with the app (SPARQL queries) directly and crudely pulling values
> out of the data.  There is no RDF to RDF uplifting.  It seems to be a
> caused by the need from upfront investment and mixing responsibilities of
> access and modelling.
>

The most common scenario I see is the following:

First, a user runs the direct mapping. They may want to see a dump of the
data, just to know what it looks like. The direct mapping is represented as
R2RML. Then a user manually edits the R2RML. The initial edits are to
change the automatically generated URIs into something more user friendly
i.e EMP/ID=1 => employee/1 or EMP#NAME => empName. After that, they start
incorporating other mappings in order to model the data into the desired
RDF output. This includes mapping to existing vocabularies (i.e. empName =>
foaf:name), concatenating attributes (concat firstname and lastname and map
that to foaf:name). Once the user realizes that they want something more
complex, they realize that they should just write a SQL query, either in
the R2RML mapping or as a view in the database.

Bottom line, it all started with the Direct Mapping. Actually, I haven't
seen anybody to start writing an R2RML mapping without starting with the
direct mapping.

Note that in this scenario, the user should be more comfortable with the
database side of things.

Another scenario, which I don't see much, is the RDF to RDF uplifting, as
you called it. This actually only works if you are ETLing the RDB to RDF.
Nevertheless, the direct mapping enables this scenario. I think that the
reason we don't see this is because there aren't a lot of tools for this.
(side note: Not a lot of interest in RDF to RDF uplifting then?)



>
> The better full mapping language of R2RML does not get the investment
> (quality of tools seems to be an issue - too much expectation of free open
> source maybe?).
>

At the moment, there is a learning curve (as expected), given the lack of
tools. We are in the process of addressing this issue :)


>
> Being "devils advocate" here ...
> I do wonder if the WG really does need to produce a *standardised* CSV to
> RDF mapping or whether the most important part is to add the best metadata
> to the CSV file and let different approaches flourish.
>
> This is based on looking at the role and responsibilities in the
> publishing chain: the publisher provides CSV files and the metadata - do
> they provide the RDF processing alogorithm as well?  Or does that involve
> consideration by the data consumer on how they intend to use the tabular
> data?


Good question. I'm only bring this up because it's in the charter. I look
forward to hearing what danbri and Ivan has to say :)


>
>
>         Andy
>
>
>
>>>
>>>
>>>      Using direct mapping seems to involve doing CSV->RDF, then RDF->RDF
>>> to
>>>      lift out the data in a more useful form.  I have doubts about two
>>> stages
>>>      processes like this because the real outcome is going the first step
>>>      alone.  Rows-encoded-in-RDF then pushes the burden onto the data
>>> consumer;
>>>      it's a barrier to reuse. Of course, it's easier to add mechanically.
>>>
>>>
>>> The Direct mapping is useful to have quick/dirty RDF. If your schema is
>>> normalized, or in this case, if the CSV is normalized, then the RDF that
>>> comes
>>> out is fairly "good".
>>>
>>> Additionally, there may be users who would want to do a RDF->RDF
>>> transformation. This is where Direct Mapping helps
>>>
>>> If users want to a CSV -> RDF transformation, then this is where a
>>> mapping
>>> language comes in.
>>>
>>> Nevertheless, I'm a huge advocate for automation, hence the Direct
>>> Mapping.
>>> Actually, we have been observing that Ultrawrap's users usually first
>>> run the
>>> Direct Mapping. The resulting mapping is represented as R2RML. Then they
>>> go in
>>> to edit the R2RML mapping.
>>>
>>>
>>>      The whole area of times and dates is messy but important.
>>>
>>>      Calculations might be done in a way that utilizes javascript, which
>>> given
>>>      the likely audience makes it not all new technology, and is a a
>>> route to
>>>      custom conversion.
>>>
>>>              Andy
>>>
>>>
>>>
>>>
>>
>
>

Received on Thursday, 20 February 2014 15:28:41 UTC