Re: CSV2RDF and R2RML from Ivan Herman on 2014-02-20 (public-csv-wg@w3.org from February 2014)

From: Ivan Herman <ivan@w3.org>
Date: Thu, 20 Feb 2014 17:17:46 +0100
To: Andy Seaborne <andy@apache.org>
CC: Juan Sequeda <juanfederico@gmail.com>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <53062AAA.3010007@w3.org>
Andy Seaborne wrote:
<snip/>
>>>
>>>      Here's a contribution:
>>>
>>>      ----------------------------
>>>      "Sales Region"," Quarter"," Sales"
>>>      "North","Q1",10
>>>      "North","Q2",15
>>>      "North","Q3",7
>>>      "North","Q4",25
>>>      "South","Q1",9
>>>      "South","Q2",15
>>>      "South","Q3",16
>>>      "South","Q4",31
>>>      ----------------------------
>>>
>>>      There are two sales regions, each with 4 sales results.
>>>
>>>      This needs some kind of term resolution to turn e.g. "North" into a URI
>>>      for the northern sales region.  It could be by an external lookup or by
>>>      URI template as in R2RML. External lookup gives better linking.
>>>
>>>      Defining "views" may help replacing the SQL with something.
>>>
>>>
>>> In this example, what would be the subject?
> 
> While we could use the row number as the basis of the primary key I think that
> *may* lead to low-value data.

Yes, I obviously agree

> 
> Just because you can convert a data table to some RDF, if the URIs are all
> locally generated, I'm not sure there is strong value in a standard here.
> 

Well, yes, that is true. *If* I was using the R2RML approach, that is (also)
where it would come in and assign some URI.

But I am afraid R2RML would be too complicated for many of our expected audience
here. That is why I followed Gregg's footsteps and pushed JSON-LD into the
forefront (which is not necessary the "kosher" thing to do in terms of RDF, ie,
mixing syntax with model...). And the reason is because, in fact, JSON-LD has a
mini-R2RML built into the system, which is @context. (That what makes it unique
among serializations. I wish we had kept the idea for RDFa, but that is water
under the bridge now.)

Ie, if the data publisher can also provide a @context in some metadata format,
then the two together may map the local names to global URI-s easily.

> In this example would ideally use "North" to resolve to a URI in the corporate
> data dictionary because the "Sales Region" columns I known to be a key (inverse
> function property).
> 
> "North" need not appear in the output.
> 
> Give:
> 
> prefix corp: <http://mycorp/globalDataDictionary/>
> 
> corp:region1 :Name "North" .
> corp:region2 :Name "South" .
> 
> We might get from row one:
> 
> corp:region1 :sales [ :period "Q1" ; :value 10 ] .
> 
> (including a blank node - a separate discussion! - let's use generated ids for
> now:)
> 
> corp:region1 :sales gen:57 ;
> gen:57 :period "Q1" ;
>        :value 10  .
> 
> 
> or a different style:
> 
> <http://corp/file/row1>
>        :region corp:region1 ;
>        :period "Q1 ;
>        :sales 10 .
> 

I think that if I follow a simple JSON mapping plus declaring the "Sales Region"
as, sort of, primary, I can get to something in JSON-LD like

{
	"North" :
		[ {
			"quarter" : "Q1",
			"sales" : 10
		},
		{
			"quarter" : "Q2",
			"sales" : 15
		} ],
	"South" :
		[ .... ]
}

(I am just making up a simple 'CSV direct mapping') which, with a suitable
@context, could then be transformed into RDF like:

[
	<http://corp/region/north>
		[ <http://corp/quarter> : "Q1", <http://corp/sales> : 10 ],
		[ <http://corp/quarter> : "Q2", <http://corp/sales> : 15 ].
	<http://corp/region/south>
		...
]

yep, bunch of blank nodes, let us put that aside for a moment. (I hope I got it
right, Gregg can correct me if I am wrong)

It is probably not exactly the Direct Mapping but, well, the be it. We have to
do things that the community can really use easily (I think the direct mapping
would mean to have separate objects based identified by row numbers, right Juan?)

> 
> 
> In my limited exposure to R2RML usage, the majority has been direct mapping,
> with the app (SPARQL queries) directly and crudely pulling values out of the
> data.  There is no RDF to RDF uplifting.  It seems to be a caused by the need
> from upfront investment and mixing responsibilities of access and modelling.
> 
> The better full mapping language of R2RML does not get the investment (quality
> of tools seems to be an issue - too much expectation of free open source maybe?).
> 

Yes, I think the scenario described by Juan is realistic, and I actually visited
a company called Antidot (in France) a while ago who did that big time. They
used Direct Mapping to get a clear image of the RDB structure...

The 'uplifting' issue is a real thorn in my side. The Direct Mapping really
works if, really, one can rely on a good RDF rule engine. And we do not have
that, which is a real shame...

> Being "devils advocate" here ...
> I do wonder if the WG really does need to produce a *standardised* CSV to RDF
> mapping or whether the most important part is to add the best metadata to the
> CSV file and let different approaches flourish.

There is no doubt in my mind that the most important part of the job of this WG
is to define the right metadata (and a way to find that metadata). I think we
can define a simple mapping to JSON/RDF/XML and yes, you are right, it will not
be a universal solution that will make everybody happy. Ie, in some cases,
people will have to do different things using that metadata. But I think it is
possible to cover, hm, at least a 60/40 if not 80/20 range...

Ivan


> 
> This is based on looking at the role and responsibilities in the publishing
> chain: the publisher provides CSV files and the metadata - do they provide the
> RDF processing alogorithm as well?  Or does that involve consideration by the
> data consumer on how they intend to use the tabular data?
> 
>     Andy
> 
<snip/>
Received on Thursday, 20 February 2014 16:18:59 UTC