Re: CSV2RDF and R2RML from Juan Sequeda on 2014-02-20 (public-csv-wg@w3.org from February 2014)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Thu, 20 Feb 2014 12:30:11 -0600
To: Gregg Kellogg <gregg@greggkellogg.net>
Cc: Ivan Herman <ivan@w3.org>, Andy Seaborne <andy@apache.org>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CAMVTWDwUgEvfFsW6JYm_x1bOtBLCDtrotS5x1us0=QewBBP_xg@mail.gmail.com>
Quick reaction: why wasn't json-ld around when we defined R2RML. :)

Juan Sequeda
+1-575-SEQ-UEDA
www.juansequeda.com


On Thu, Feb 20, 2014 at 12:18 PM, Gregg Kellogg <gregg@greggkellogg.net>wrote:

> On Feb 20, 2014, at 8:17 AM, Ivan Herman <ivan@w3.org> wrote:
>
> >
> >
> > Andy Seaborne wrote:
> > <snip/>
> >>>>
> >>>>     Here's a contribution:
> >>>>
> >>>>     ----------------------------
> >>>>     "Sales Region"," Quarter"," Sales"
> >>>>     "North","Q1",10
> >>>>     "North","Q2",15
> >>>>     "North","Q3",7
> >>>>     "North","Q4",25
> >>>>     "South","Q1",9
> >>>>     "South","Q2",15
> >>>>     "South","Q3",16
> >>>>     "South","Q4",31
> >>>>     ----------------------------
> >>>>
> >>>>     There are two sales regions, each with 4 sales results.
> >>>>
> >>>>     This needs some kind of term resolution to turn e.g. "North" into
> a URI
> >>>>     for the northern sales region.  It could be by an external lookup
> or by
> >>>>     URI template as in R2RML. External lookup gives better linking.
> >>>>
> >>>>     Defining "views" may help replacing the SQL with something.
> >>>>
> >>>>
> >>>> In this example, what would be the subject?
> >>
> >> While we could use the row number as the basis of the primary key I
> think that
> >> *may* lead to low-value data.
> >
> > Yes, I obviously agree
> >
> >>
> >> Just because you can convert a data table to some RDF, if the URIs are
> all
> >> locally generated, I'm not sure there is strong value in a standard
> here.
> >>
> >
> > Well, yes, that is true. *If* I was using the R2RML approach, that is
> (also)
> > where it would come in and assign some URI.
> >
> > But I am afraid R2RML would be too complicated for many of our expected
> audience
> > here. That is why I followed Gregg's footsteps and pushed JSON-LD into
> the
> > forefront (which is not necessary the "kosher" thing to do in terms of
> RDF, ie,
> > mixing syntax with model...). And the reason is because, in fact,
> JSON-LD has a
> > mini-R2RML built into the system, which is @context. (That what makes it
> unique
> > among serializations. I wish we had kept the idea for RDFa, but that is
> water
> > under the bridge now.)
> >
> > Ie, if the data publisher can also provide a @context in some metadata
> format,
> > then the two together may map the local names to global URI-s easily.
>
>
> That's exactly the point, although some amount of metadata beyond what
> JSON-LD provides is likely necessary to handle more real-world use cases
> (such as composite primary keys).
>
> >> In this example would ideally use "North" to resolve to a URI in the
> corporate
> >> data dictionary because the "Sales Region" columns I known to be a key
> (inverse
> >> function property).
> >>
> >> "North" need not appear in the output.
> >>
> >> Give:
> >>
> >> prefix corp: <http://mycorp/globalDataDictionary/>
> >>
> >> corp:region1 :Name "North" .
> >> corp:region2 :Name "South" .
> >>
> >> We might get from row one:
> >>
> >> corp:region1 :sales [ :period "Q1" ; :value 10 ] .
> >>
> >> (including a blank node - a separate discussion! - let's use generated
> ids for
> >> now:)
> >>
> >> corp:region1 :sales gen:57 ;
> >> gen:57 :period "Q1" ;
> >>       :value 10  .
> >>
> >>
> >> or a different style:
> >>
> >> <http://corp/file/row1>
> >>       :region corp:region1 ;
> >>       :period "Q1 ;
> >>       :sales 10 .
> >>
> >
> > I think that if I follow a simple JSON mapping plus declaring the "Sales
> Region"
> > as, sort of, primary, I can get to something in JSON-LD like
> >
> > {
> >       "North" :
> >               [ {
> >                       "quarter" : "Q1",
> >                       "sales" : 10
> >               },
> >               {
> >                       "quarter" : "Q2",
> >                       "sales" : 15
> >               } ],
> >       "South" :
> >               [ .... ]
> > }
> >
> > (I am just making up a simple 'CSV direct mapping') which, with a
> suitable
> > @context, could then be transformed into RDF like:
> >
> > [
> >       <http://corp/region/north>
> >               [ <http://corp/quarter> : "Q1", <http://corp/sales> : 10
> ],
> >               [ <http://corp/quarter> : "Q2", <http://corp/sales> : 15
> ].
> >       <http://corp/region/south>
> >               ...
> > ]
> >
> > yep, bunch of blank nodes, let us put that aside for a moment. (I hope I
> got it
> > right, Gregg can correct me if I am wrong)
> >
> > It is probably not exactly the Direct Mapping but, well, the be it. We
> have to
> > do things that the community can really use easily (I think the direct
> mapping
> > would mean to have separate objects based identified by row numbers,
> right Juan?)
>
> If the "Sales Region" is used to create an identifier, then you could get
> something like that. In this case, though, you might want to make Sales
> Region to something like dc:title and assert that it is unique, in some
> way, so that a BNode is allocated for it. This might be done implicitly
> given a chained representation such as the following:
>
> {
>   "@context": {
>     "dc": "http://purl.org/dc/terms/",
>     "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
>     "ex": "http://example/",
>     "Sales Region": "dc:title",
>     "Quarter": "dc:title",
>     "Sales": "ex:revenue"
>   },
>   "@type": "ex:SalesRegion",
>   "Sales Region": null,
>   "ex:period": {
>     "@type": "ex:SalesPeriod",
>     "Quarter": null,
>     "Sales": null
>   }
> }
>
> This would result in something like the following:
>
> [] a ex:SalesRegion;
>    dc:title "North";
>    ex:period
>      [ a ex:SalesPeriod; dc:title "Q1", ex:revenue 10],
>      [ a ex:SalesPeriod; dc:title "Q2", ex:revenue 15],
>      [ a ex:SalesPeriod; dc:title "Q3", ex:revenue 7],
>      [ a ex:SalesPeriod; dc:title "Q4", ex:revenue 25] .
>
> [] a ex:SalesRegion;
>    dc:title "South";
>    ex:period
>      [ a ex:SalesPeriod; dc:title "Q1", ex:revenue 9],
>      [ a ex:SalesPeriod; dc:title "Q2", ex:revenue 15],
>      [ a ex:SalesPeriod; dc:title "Q3", ex:revenue 16],
>      [ a ex:SalesPeriod; dc:title "Q4", ex:revenue 31] .
>
> It may be that in some cases, we want to map one column to two properties,
> for example to create both a relative IRI subject and title based on Sales
> Region and Quarter.
>
> Gregg
>
> >> In my limited exposure to R2RML usage, the majority has been direct
> mapping,
> >> with the app (SPARQL queries) directly and crudely pulling values out
> of the
> >> data.  There is no RDF to RDF uplifting.  It seems to be a caused by
> the need
> >> from upfront investment and mixing responsibilities of access and
> modelling.
> >>
> >> The better full mapping language of R2RML does not get the investment
> (quality
> >> of tools seems to be an issue - too much expectation of free open
> source maybe?).
> >>
> >
> > Yes, I think the scenario described by Juan is realistic, and I actually
> visited
> > a company called Antidot (in France) a while ago who did that big time.
> They
> > used Direct Mapping to get a clear image of the RDB structure...
> >
> > The 'uplifting' issue is a real thorn in my side. The Direct Mapping
> really
> > works if, really, one can rely on a good RDF rule engine. And we do not
> have
> > that, which is a real shame...
> >
> >> Being "devils advocate" here ...
> >> I do wonder if the WG really does need to produce a *standardised* CSV
> to RDF
> >> mapping or whether the most important part is to add the best metadata
> to the
> >> CSV file and let different approaches flourish.
> >
> > There is no doubt in my mind that the most important part of the job of
> this WG
> > is to define the right metadata (and a way to find that metadata). I
> think we
> > can define a simple mapping to JSON/RDF/XML and yes, you are right, it
> will not
> > be a universal solution that will make everybody happy. Ie, in some
> cases,
> > people will have to do different things using that metadata. But I think
> it is
> > possible to cover, hm, at least a 60/40 if not 80/20 range...
> >
> > Ivan
> >
> >
> >>
> >> This is based on looking at the role and responsibilities in the
> publishing
> >> chain: the publisher provides CSV files and the metadata - do they
> provide the
> >> RDF processing alogorithm as well?  Or does that involve consideration
> by the
> >> data consumer on how they intend to use the tabular data?
> >>
> >>    Andy
> >>
> > <snip/>
> >
>
>
Received on Thursday, 20 February 2014 18:30:59 UTC