Re: Provenance from Ivan Herman on 2014-05-21 (public-csv-wg@w3.org from May 2014)

From: Ivan Herman <ivan@w3.org>
Date: Wed, 21 May 2014 13:53:59 +0200
To: Christopher Gutteridge <cjg@ecs.soton.ac.uk>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <5DCACCB2-AA34-4D99-90A8-03D8EEF16ADF@w3.org>
Thanks.

I think this all looks reasonable (and I would probably go for the slightly longer version). I actually think that similar information should be put into the generated JSON, too, regardless of whether it is JSON-LD or not.

I would propose, just that we would not forget this, to add an open issue to our issue management on github[1], with a reference to this thread. I presume that we have, as you say, more urgent things to solve first, so it is not yet a priority, but we should not forget to deal with it eventually; that is why the issue handling has been invented:-)

Thanks a lot Christopher for raising this!

Ivan

P.S. Christopher, I do not think I have added you to github already; this would be necessary for you to edit document and add issues. Can you send me your github handle? Thanks.

[1] https://github.com/w3c/csvw/issues

On 21 May 2014, at 12:38 , Christopher Gutteridge <cjg@ecs.soton.ac.uk> wrote:

> OKdokes. There's several possible issues, as it's easy to confuse a URI for the dataset with a URI or URL for the actual document. eg. You can have an RDF dataset expressed as .rdf or .ttl or .ntriples etc. Let's just assume it's all URLs for now.
> 
> Source CSV: http://example.org/input.csv
> Output RDF: http://example.org/output.rdf
> CSV Metadata: http://example.org/myformat.metadata
> 
> @prefix time: <http://www.w3.org/2006/time#>.
> @prefix prov: <http://www.w3.org/ns/prov#>.
> @prefix xsd:  <http://www.w3.org/2001/XMLSchema#>.
> 
> <http://example.org/output.rdf#provenance> a prov:Activity ;
>    prov:endedAtTime> "2014-05-21T09:50:01+01:00"^^xsd:dateTime ;
>    prov:startedAtTime> "2014-05-21T09:50:01+01:00"^^xsd:dateTime ;
>    prov:generated <http://example.org/output.rdf> ;
>    prov:used <http://example.org/input.csv>, <http://example.org/myformat.metadata> .
> 
> What would make this far more useful is a single additional triple which indicates that the process used was a specific standard process. eg. w3c csv->rdf v1.0. Also possibly a way to distinguish that one "used" document describes the other. A more complex example would be this (I've just busked it and invented some csv2rdf properties, I'm not recommending them as-is)
> 
> <http://example.org/output.rdf#provenance> a prov:Activity ;
>    prov:endedAtTime> "2014-05-21T09:50:01+01:00"^^xsd:dateTime ;
>    prov:startedAtTime> "2014-05-21T09:50:01+01:00"^^xsd:dateTime ;
>    prov:generated <http://example.org/output.rdf> ;
> 
>   prov:qualifiedUsage [
>      a prov:Usage;
>      prov:entity    <http://example.org/input.csv> ;
>      prov:hadRole   csv2rdf:tabularDataToConvert .
>   ];
> 
>   prov:qualifiedUsage [
>      a prov:Usage;
>      prov:entity    <http://example.org/myformat.metadata> ;
>      prov:hadRole   csv2rdf:tabularMetadata .
>   ] .
> 
> <http://example.org/myformat.metadata> a csv2rdf:TabularDataMetadataDocument ;
>    csv2rdf:describes <http://example.org/input.csv> .
> 
> 
> 
> 
> On 21/05/14 10:35, Ivan Herman wrote:
>> Christopher,
>> 
>> I think it is a good idea to add some provenance information to the output. Do you think you can write down, at least in a sketch, what triples you think should be generated using the metadata information we have in the metadata document?
>> 
>> Thanks
>> 
>> Ivan
>> 
>> P.S. You probably know the saying: no good deed goes unpunished:-)
>> 
>> 
>> 
>> On 21 May 2014, at 11:02 , Christopher Gutteridge <cjg@ecs.soton.ac.uk> wrote:
>> 
>>> While it's not a top priority, I see an exciting use for some of the recent provenance vocab. work. For the Tabular(CSV)->Graph(RDF) route anyhow, as it's possible to add extra triples. We may well know the URI of the source table, and the URI of the metadata document. That's provenance right there. I would suggest (not as a high priority) that a recommended RDF way to express this relationship could be included in this work. eg. The triples in the output RDF saying it was generated from source document(s) X, using metadata Y and process Z at a given time & date by an agent (the organisation/person/system making the conversion).
>>> 
>>> It should be just a handful of extra triples, and optional, but it would be good to give people a standard to follow. And also URIs to reference for the process followed (the algorithms being discussed now).
>>> 
>>> You can see an example of what I mean at the top of this TTL file:
>>> http://data.southampton.ac.uk/dumps/jargon/2014-05-08/jargon.ttl
>>> (ignore the http://purl.org/void/provenance/ns/ triples, that was the previous vocab we used and are now transitioning to http://www.w3.org/ns/prov#)
>>> -- 
>>> Christopher Gutteridge --
>>> http://users.ecs.soton.ac.uk/cjg
>>> 
>>> 
>>> University of Southampton Open Data Service:
>>> http://data.southampton.ac.uk/
>>> 
>>> You should read the ECS Web Team blog:
>>> http://blogs.ecs.soton.ac.uk/webteam/
>>> 
>>> 
>>> 
>> 
>> ----
>> Ivan Herman, W3C
>> Digital Publishing Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> GPG: 0x343F1A3D
>> WebID: http://www.ivan-herman.net/foaf#me
>> 
>> 
>> 
>> 
>> 
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
WebID: http://www.ivan-herman.net/foaf#me
Received on Wednesday, 21 May 2014 11:54:33 UTC