Re: On Provenance for Queries for Web Data from Olaf Hartig on 2010-05-18 (public-xg-prov@w3.org from May 2010)

From: Olaf Hartig <hartig@informatik.hu-berlin.de>
Date: Tue, 18 May 2010 10:51:01 +0200
To: Irini Fundulaki <fundul@ics.forth.gr>
Cc: public-xg-prov@w3.org
Message-Id: <201005181051.02097.hartig@informatik.hu-berlin.de>
Hey Irini,

On Thursday 13 May 2010 22:17:43 Irini Fundulaki wrote:
> Dear all,
> 
> For your information, you can find at
> http://www.csd.uoc.gr/~theohari/files/OnProvenanceOfQueriesForWebData.pdf
> a short paper summarizing the preliminary results of our work on
> representing data provenance for  SPARQL queries evaluated on Linked Data.
> In the paper we (a) summarize the benefits of using abstract provenance
> models
> compared to annotation-based systems,
> (b) present the different models of data provenance developed so
> far in the relational world and (c) their limitations
> in capturing implicit provenance of SPARQL queries.
> 
> Any comments/suggestions are more than welcome.
> Looking forward to hearing from you

Interesting work, thanks for sharing.

I have several comments, starting with the most important one (that you may 
not like to hear).

Your whole argumentation in Sec.4.3 is based on an interpretation of the 
"trust semantics" from my trust-aware SPARQL extension tSPARQL. You should 
explain this semantics in the paper because it is fundamental to everything 
that follows. However, that's not the main problem. The problem is that your 
interpretation of the tSPARQL semantics is wrong. Sorry to say that.
The tSPARQL algebra is an extension of the SPARQL algebra that does not affect 
the results of the original SPARQL algebra operators. This means, for the same 
input mappings the tSPARQL operators produce exactly the same output mappings 
as the corresponding SPARQL operator does. Not more, not less. The tSPARQL 
algebra only augments the semantics of the SPARQL operators by defining the 
trust values that have to be associated with results of an operator based on 
the trust values associated with the corresponding input mappings. Therefore, 
if you apply tSPARQL semantics to the example in your paper (Tables 8) for

  LeftJoin ( Omega_1 , Omega_4 )

where Omega_1 contains two trust weighted solution mappings:

  mu_1 := ( {(?x,d),(?y,b)} , true )
  mu_2 := ( {(?x,f),(?y,g)} , true )

and Omega_4 contains:

  mu_17 := ( {(?y,b),(?z,c)} , false )

then the LeftJoin operator (Definition 2.8 in the tSPARQL spec) returns two 
results:

  mu'_19 := ( {(?x,d),(?y,b),(?z,c)} , tm(true,false) )
  mu_20 := ( {(?x,f),(?y,g)} , true )

That's it. Nothing more. If you assume a pessimistic trust merge function that 
defines 

 tm(true,false) := false

then mu'_19 becomes mu_19 as in your example.
But there is no mu_21 as in your example. How do you determine this mapping?

Maybe, you mixed up your example with an example that additionally uses the 
ensure trust operator EnTrust that is introduced by tSPARQL (there is nothing 
like this in pure SPARQL). Given you apply such an EnTrust to Omega_4 *before* 
you do the LeftJoin then you get mu_21 (but no mu_19 anymore). However, such 
an algebra expression (with an EnTrust operator) has a different semantics than 
the expression without the EnTrust.

Further comments:

* It is not clear what you mean by "abstract provenance model". Sec.2.1 does 
not give an appropriate definition. It is also not clear what a "provenance 
expression" is or what you understand of a "abstract operations" (second 
sentence in Sec.2.2).

* As an alternative approach to the use of an "abstract provenance model" you 
discuss the annotation of source data with values (Sec.2.2). That's, 
basically, what I do with tSPARQL, only that the trust values in my case are 
not assumed to be fixed (i.e. calculated once) and global (i.e. not 
subjective). However, it might seem a bit strange that you base your whole 
argumentation in Sec.4.3 on my annotation based approach when you state in 
Sec.2.2 that annotation based approaches are unsuitable in the context of 
Linked Data.

* In the first paragraph of Sec.3.1 you introduce a boolean trust based 
example. You may want to mention here that a more sceptical/pessimistic user 
may associate both operators \oplus and \odot with the a logical AND.

* In Sec.4.4 you write that for DESCRIBE queries "the output contains all 
triples that have that value as a subject, predicate or object." That's not 
true. The SPARQL spec does not prescribe what exactly the result of a
DESCRIBE query is.

* In the paper you mention several times that you aim to find a provenance 
model for Linked Data or for SPARQL queries over Linked Data. However, that's 
not what you discuss in the paper. Everything that you do in the paper is 
related to SPARQL queries. There is nothing Linked Data specific about this. 
The execution of SPARQL queries over Linked Data is only a good use case for 
your work.

Please don't take these comments as a general deprecation of your work.
I really like your analysis; it is a very valuable contribution!

Greetings,
Olaf
Received on Tuesday, 18 May 2010 08:53:04 UTC