Re: concept illustrations for the data journalism example from Iker Huerga on 2011-05-12 (public-prov-wg@w3.org from May 2011)

From: Iker Huerga <ihuerga@linkatu.net>
Date: Thu, 12 May 2011 10:10:21 +0200
To: public-prov-wg@w3.org
Message-ID: <4DCB95ED.7070300@linkatu.net>
Hello Olaf, All,
> Out of curiosity I tried to describe the processing steps of the example
> using the Provenance Vocabulary [1].

Great work.

> 1.) The example does not talk about specific points in time at which the
> different processing steps happened (Hence, I omitted corresponding
> statements in my description). Shouldn't the example extended with such
> kind of information?

In my opinion, yes it should.

> 2.) Processing step 4 says: "analyst (alice) downloads a turtle
> serialization (lcp1) ..." While I was trying to describe that fact, it
> felt strange that Alice was the agent/actor that accessed the server.
> Hence, I would say that Alice cannot download lcp1 directly, she must use
> an HTTP client software for that. Same for Bob in processing step 8.
> Should we add that to the example?

I agree with Olaf, I think that the object of the prv:performedBy 
propertys should be an HTTP agent, for instance an sparql endpoint in a 
query scenario.

> 3.) Processing step 7 says "government (gov) publishes an update (d2) of
> data (d1) as a new Web resource (r2)". That's inconsistent with processing
> steps 1 and 3 where gov publishes a Web resource r1 with RDF data f1
> generated from d1. Question: Was it the intention that gov now publishes
> d2 directly; wouldn't it be more consistent if gov were publishing RDF
> data f2 which was obtained from d2?

I think this could be achieved through SPARQL CREATE and INSERT (both 
included in SPARQL 1.1) by creating a new graph and then inserting the 
new triples. But for this example I would modify the Processing step 7 
as Olaf suggests.


Regarding processing step 2, I think that Olaf's suggestion of making 
ex:prov a Named Graph containing provenance information would be the 
best option. In my honest opinion, I am not a provenance expert, 
provenance information shouldn't be added to the HTTP payload, this 
could cause a network overhead . In the Web scenario there will be 
agents requesting either for provenance information or not.

If the approach, as I read in the "Guide to the Provenance Vocabulary", 
is to extend tools for automatically publishing provenance information, 
I would recommend that these tools generate a different graph for 
provenance information for each prv:DataItem. I will give an example 
extending processing step 2.

being exf1= http://example.org/f1/ and ex=http://example.org/

exf1:prov  rdf:type dcterms:ProvenanceStatement;
               rdf:about 
ex:f1.                                            # I really do not know 
whether rdf:about can be used
ex:f1      rdf:type prv:DataItem;                                 # in 
this context or not. In that case sioc:about could
               prv:createdBy [rdf:type prv:DataCreation;  # be used instead
                                        prv:usedData ex:d1;
                                        prv:performedBy ex:gov ] .

Thus, agent could automatically retrieve provenance information if 
necessary just requesting resource's URI plus prov, for instance.

What do you think about this approach? Is it a misconception by myself?

Best Regards.

-- 
Iker Huerga Sánchez
Co-Founder, LINKatu
Polo de Innovación Garaia
Goiru, 1. Edificio A, 4º Piso
20500 Arrasate - Gipuzkoa
T+34 943 712 072 F+34 943 712 223
ihuerga@linkatu.net
http://www.linkatu.net
Received on Thursday, 12 May 2011 08:13:18 UTC