Re: CSV2RDF and R2RML from Gregg Kellogg on 2014-02-20 (public-csv-wg@w3.org from February 2014)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Thu, 20 Feb 2014 10:18:44 -0800
To: Ivan Herman <ivan@w3.org>
Cc: Andy Seaborne <andy@apache.org>, Juan Sequeda <juanfederico@gmail.com>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-Id: <32D1FC39-6DCA-4266-A470-8064EAE7CA85@greggkellogg.net>
On Feb 20, 2014, at 8:17 AM, Ivan Herman <ivan@w3.org> wrote:

> 
> 
> Andy Seaborne wrote:
> <snip/>
>>>> 
>>>>     Here's a contribution:
>>>> 
>>>>     ----------------------------
>>>>     "Sales Region"," Quarter"," Sales"
>>>>     "North","Q1",10
>>>>     "North","Q2",15
>>>>     "North","Q3",7
>>>>     "North","Q4",25
>>>>     "South","Q1",9
>>>>     "South","Q2",15
>>>>     "South","Q3",16
>>>>     "South","Q4",31
>>>>     ----------------------------
>>>> 
>>>>     There are two sales regions, each with 4 sales results.
>>>> 
>>>>     This needs some kind of term resolution to turn e.g. "North" into a URI
>>>>     for the northern sales region.  It could be by an external lookup or by
>>>>     URI template as in R2RML. External lookup gives better linking.
>>>> 
>>>>     Defining "views" may help replacing the SQL with something.
>>>> 
>>>> 
>>>> In this example, what would be the subject?
>> 
>> While we could use the row number as the basis of the primary key I think that
>> *may* lead to low-value data.
> 
> Yes, I obviously agree
> 
>> 
>> Just because you can convert a data table to some RDF, if the URIs are all
>> locally generated, I'm not sure there is strong value in a standard here.
>> 
> 
> Well, yes, that is true. *If* I was using the R2RML approach, that is (also)
> where it would come in and assign some URI.
> 
> But I am afraid R2RML would be too complicated for many of our expected audience
> here. That is why I followed Gregg's footsteps and pushed JSON-LD into the
> forefront (which is not necessary the "kosher" thing to do in terms of RDF, ie,
> mixing syntax with model...). And the reason is because, in fact, JSON-LD has a
> mini-R2RML built into the system, which is @context. (That what makes it unique
> among serializations. I wish we had kept the idea for RDFa, but that is water
> under the bridge now.)
> 
> Ie, if the data publisher can also provide a @context in some metadata format,
> then the two together may map the local names to global URI-s easily.


That's exactly the point, although some amount of metadata beyond what JSON-LD provides is likely necessary to handle more real-world use cases (such as composite primary keys).

>> In this example would ideally use "North" to resolve to a URI in the corporate
>> data dictionary because the "Sales Region" columns I known to be a key (inverse
>> function property).
>> 
>> "North" need not appear in the output.
>> 
>> Give:
>> 
>> prefix corp: <http://mycorp/globalDataDictionary/>
>> 
>> corp:region1 :Name "North" .
>> corp:region2 :Name "South" .
>> 
>> We might get from row one:
>> 
>> corp:region1 :sales [ :period "Q1" ; :value 10 ] .
>> 
>> (including a blank node - a separate discussion! - let's use generated ids for
>> now:)
>> 
>> corp:region1 :sales gen:57 ;
>> gen:57 :period "Q1" ;
>>       :value 10  .
>> 
>> 
>> or a different style:
>> 
>> <http://corp/file/row1>
>>       :region corp:region1 ;
>>       :period "Q1 ;
>>       :sales 10 .
>> 
> 
> I think that if I follow a simple JSON mapping plus declaring the "Sales Region"
> as, sort of, primary, I can get to something in JSON-LD like
> 
> {
> 	"North" :
> 		[ {
> 			"quarter" : "Q1",
> 			"sales" : 10
> 		},
> 		{
> 			"quarter" : "Q2",
> 			"sales" : 15
> 		} ],
> 	"South" :
> 		[ .... ]
> }
> 
> (I am just making up a simple 'CSV direct mapping') which, with a suitable
> @context, could then be transformed into RDF like:
> 
> [
> 	<http://corp/region/north>
> 		[ <http://corp/quarter> : "Q1", <http://corp/sales> : 10 ],
> 		[ <http://corp/quarter> : "Q2", <http://corp/sales> : 15 ].
> 	<http://corp/region/south>
> 		...
> ]
> 
> yep, bunch of blank nodes, let us put that aside for a moment. (I hope I got it
> right, Gregg can correct me if I am wrong)
> 
> It is probably not exactly the Direct Mapping but, well, the be it. We have to
> do things that the community can really use easily (I think the direct mapping
> would mean to have separate objects based identified by row numbers, right Juan?)

If the "Sales Region" is used to create an identifier, then you could get something like that. In this case, though, you might want to make Sales Region to something like dc:title and assert that it is unique, in some way, so that a BNode is allocated for it. This might be done implicitly given a chained representation such as the following:

{
  "@context": {
    "dc": "http://purl.org/dc/terms/",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "ex": "http://example/",
    "Sales Region": "dc:title",
    "Quarter": "dc:title",
    "Sales": "ex:revenue"
  },
  "@type": "ex:SalesRegion",
  "Sales Region": null,
  "ex:period": {
    "@type": "ex:SalesPeriod",
    "Quarter": null,
    "Sales": null
  }
}

This would result in something like the following:

[] a ex:SalesRegion;
   dc:title "North";
   ex:period
     [ a ex:SalesPeriod; dc:title "Q1", ex:revenue 10],
     [ a ex:SalesPeriod; dc:title "Q2", ex:revenue 15],
     [ a ex:SalesPeriod; dc:title "Q3", ex:revenue 7],
     [ a ex:SalesPeriod; dc:title "Q4", ex:revenue 25] .

[] a ex:SalesRegion;
   dc:title "South";
   ex:period
     [ a ex:SalesPeriod; dc:title "Q1", ex:revenue 9],
     [ a ex:SalesPeriod; dc:title "Q2", ex:revenue 15],
     [ a ex:SalesPeriod; dc:title "Q3", ex:revenue 16],
     [ a ex:SalesPeriod; dc:title "Q4", ex:revenue 31] .

It may be that in some cases, we want to map one column to two properties, for example to create both a relative IRI subject and title based on Sales Region and Quarter.

Gregg

>> In my limited exposure to R2RML usage, the majority has been direct mapping,
>> with the app (SPARQL queries) directly and crudely pulling values out of the
>> data.  There is no RDF to RDF uplifting.  It seems to be a caused by the need
>> from upfront investment and mixing responsibilities of access and modelling.
>> 
>> The better full mapping language of R2RML does not get the investment (quality
>> of tools seems to be an issue - too much expectation of free open source maybe?).
>> 
> 
> Yes, I think the scenario described by Juan is realistic, and I actually visited
> a company called Antidot (in France) a while ago who did that big time. They
> used Direct Mapping to get a clear image of the RDB structure...
> 
> The 'uplifting' issue is a real thorn in my side. The Direct Mapping really
> works if, really, one can rely on a good RDF rule engine. And we do not have
> that, which is a real shame...
> 
>> Being "devils advocate" here ...
>> I do wonder if the WG really does need to produce a *standardised* CSV to RDF
>> mapping or whether the most important part is to add the best metadata to the
>> CSV file and let different approaches flourish.
> 
> There is no doubt in my mind that the most important part of the job of this WG
> is to define the right metadata (and a way to find that metadata). I think we
> can define a simple mapping to JSON/RDF/XML and yes, you are right, it will not
> be a universal solution that will make everybody happy. Ie, in some cases,
> people will have to do different things using that metadata. But I think it is
> possible to cover, hm, at least a 60/40 if not 80/20 range...
> 
> Ivan
> 
> 
>> 
>> This is based on looking at the role and responsibilities in the publishing
>> chain: the publisher provides CSV files and the metadata - do they provide the
>> RDF processing alogorithm as well?  Or does that involve consideration by the
>> data consumer on how they intend to use the tabular data?
>> 
>>    Andy
>> 
> <snip/>
>
Received on Thursday, 20 February 2014 18:19:16 UTC