Re: CSV2RDF and R2RML from Juan Sequeda on 2014-02-18 (public-csv-wg@w3.org from February 2014)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Tue, 18 Feb 2014 09:05:00 -0600
To: Ivan Herman <ivan@w3.org>
Cc: Andy Seaborne <andy@apache.org>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CAMVTWDzHaiu1BhqC1bqUtCaw_7d58TPenmDKL_YY+2hk3Ttbhg@mail.gmail.com>
On Tue, Feb 18, 2014 at 5:50 AM, Ivan Herman <ivan@w3.org> wrote:

>
>
> Andy Seaborne wrote:
> > On 12/02/14 16:57, Juan Sequeda wrote:
> >> So... I believe I can bring some thoughts to the table wrt CSV to RDF.
> >> Part of these thoughts come from conversations that I have had
> >> previously with danbri.
> >>
> >> I saw in today's minutes that the RDB2RDF topic came up. I agree with
> >> Axel that "CSV2RDF should be just a "dialect/small modification" of the
> >> existing RDB2RDF spec". I actually encourage that there exists both a
> >> Direct Mapping (completely automated mapping) and a modification of
> R2RML.
> >>
> >> The following issues arise:
> >> - How do you know if the first column is a header or not.
> >> - How do you know if there exists an id attribute/field which acts as a
> >> unique identifier for the tuple (i.e primary key).
> >>
> >> Therefore, there needs to be a way to state this in a standard way. I'm
> >> assuming this is going to go somewhere. Given this information, the
> >> Direct Mapping standard should apply transparently (or so I believe at
> >> this moment).
> >>
> >> Now with R2RML, I believe some changes need to be made. R2RML was made
> >> to take advantage of SQL as much as possible; that is why you can define
> >> a mapping on table or on a sql query. Take for example the following
> >> R2RML mappings for Musicbrainz [4]. You can see that the tuples from
> >> "SELECT * FROM artist WHERE artist.type = 1" are mapped to instances of
> >> mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE
> >> artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure
> >> how to do this without a SQL engine. Therefore, should SQL engines be
> >> involved in the CSV to RDF transformation?
> >>
> >> Another instance where R2RML relies heavily on SQL is when you want to
> >> translate database codes into IRIs [5]. For example, if you have a code
> >> value "eng" which should be mapped to some URI
> >> http://example.com/engineering, which is part of a well defined
> >> thesaurus/vocabulary.
> >
> > I agree that R2RML is a possible starting point and also that it does not
> > apply automatically.
> >
> > There often isn't an explicit primary key nor proper foreign keys.  If
> the CSV
> > conversion process can influence the CSV format, then there is a lot
> that can
> > be done but if the CSV format is fixed, it may not be ideal.
> I am not sure what you mean here when you say 'influence'. I presume we
> should
> take the CSV format as a black box, meaning that this group will not (and
> should
> not) define what CSV is and we should be able to take anything that is out
> there....
>

+1


>
> The attached metadata may be used to define a 'primary key', ie, a column
> that
> serves as such, but I do not think we should even talk about foreign keys
> in
> this context...
>

Fair enough.


>
> The problem with R2RML, as Juan's example above shows, that it goes very
> quickly
> to the usage of SQL as a tool for all kinds of clever tricks. Which is
> great for
> R2RML, but I do not think we should go down that route for CSV. I am
> curious
> what our use case will reveal in this sense but my gut feeling tells me
> that
> most of the CSV data out there are structurally simple (albeit possibly
> huge...)
>
> Ivan
> >
> > A single table may be a denormalized view and somehow the data
> structuring
> > needs to be put back into the output.
> >
> > It might be useful if we have a very simple concrete synthetic example
> to talk
> > about in discussing conversion options.
> >
> > Here's a contribution:
> >
> > ----------------------------
> > "Sales Region"," Quarter"," Sales"
> > "North","Q1",10
> > "North","Q2",15
> > "North","Q3",7
> > "North","Q4",25
> > "South","Q1",9
> > "South","Q2",15
> > "South","Q3",16
> > "South","Q4",31
> > ----------------------------
> >
> > There are two sales regions, each with 4 sales results.
> >
> > This needs some kind of term resolution to turn e.g. "North" into a URI
> for
> > the northern sales region.  It could be by an external lookup or by URI
> > template as in R2RML. External lookup gives better linking.
> >
> > Defining "views" may help replacing the SQL with something.
> >
> > Using direct mapping seems to involve doing CSV->RDF, then RDF->RDF to
> lift
> > out the data in a more useful form.  I have doubts about two stages
> processes
> > like this because the real outcome is going the first step alone.
> > Rows-encoded-in-RDF then pushes the burden onto the data consumer; it's a
> > barrier to reuse. Of course, it's easier to add mechanically. The whole
> area
> > of times and dates is messy but important.
> > Calculations might be done in a way that utilizes javascript, which
> given the
> > likely audience makes it not all new technology, and is a a route to
> custom
> > conversion.
> >
> >     Andy
> >
> >
>
>
Received on Tuesday, 18 February 2014 15:05:51 UTC