Re: On Provenance for Queries for Web Data from Irini Fundulaki on 2010-05-20 (public-xg-prov@w3.org from May 2010)

From: Irini Fundulaki <fundul@ics.forth.gr>
Date: Thu, 20 May 2010 11:50:05 +0300
To: Olaf Hartig <hartig@informatik.hu-berlin.de>
CC: public-xg-prov@w3.org, Vassilis Christophides <christop@ics.forth.gr>, Grigoris Karvounarakis <gregkar@gmail.com>, Yannis Theoharis <ytheohar@gmail.com>
Message-ID: <4BF4F7BD.7090702@ics.forth.gr>
First of all, apologies to the list for the long email.
Olaf, all, you can find our comments inlined.

On 5/18/10 11:51 AM, Olaf Hartig wrote:
 >
 > Interesting work, thanks for sharing.
 >
 > I have several comments, starting with the most important one (that 
you may
 > not like to hear).
 >
 > Your whole argumentation in Sec.4.3 is based on an interpretation of the
 > "trust semantics" from my trust-aware SPARQL extension tSPARQL. You 
should
 > explain this semantics in the paper because it is fundamental to 
everything
 > that follows. However, that's not the main problem. The problem is 
that your
 > interpretation of the tSPARQL semantics is wrong. Sorry to say that.
 > The tSPARQL algebra is an extension of the SPARQL algebra that does 
not affect
 > the results of the original SPARQL algebra operators. This means, for 
the same
 > input mappings the tSPARQL operators produce exactly the same output 
mappings
 > as the corresponding SPARQL operator does. Not more, not less. The 
tSPARQL
 > algebra only augments the semantics of the SPARQL operators by 
defining the
 > trust values that have to be associated with results of an operator 
based on
 > the trust values associated with the corresponding input mappings. 
Therefore,
 > if you apply tSPARQL semantics to the example in your paper (Tables 
8) for
 >
 >   LeftJoin ( Omega_1 , Omega_4 )
 >
 > where Omega_1 contains two trust weighted solution mappings:
 >
 >   mu_1 := ( {(?x,d),(?y,b)} , true )
 >   mu_2 := ( {(?x,f),(?y,g)} , true )
 >
 > and Omega_4 contains:
 >
 >   mu_17 := ( {(?y,b),(?z,c)} , false )
 >
 > then the LeftJoin operator (Definition 2.8 in the tSPARQL spec) 
returns two
 > results:
 >
 >   mu'_19 := ( {(?x,d),(?y,b),(?z,c)} , tm(true,false) )
 >   mu_20 := ( {(?x,f),(?y,g)} , true )
 >
 > That's it. Nothing more. If you assume a pessimistic trust merge 
function that
 > defines
 >
 >  tm(true,false) := false
 >
 > then mu'_19 becomes mu_19 as in your example.
 > But there is no mu_21 as in your example. How do you determine this 
mapping?
 >
 > Maybe, you mixed up your example with an example that additionally 
uses the
 > ensure trust operator EnTrust that is introduced by tSPARQL (there is 
nothing
 > like this in pure SPARQL). Given you apply such an EnTrust to Omega_4 
*before*
 > you do the LeftJoin then you get mu_21 (but no mu_19 anymore). 
However, such
 > an algebra expression (with an EnTrust operator) has a different 
semantics than
 > the expression without the EnTrust.

We believe that you might have misread the paper. Our objective is to
understand the requirements for the boolean trust application when the
SPARQL OPTIONAL operator is concerned (because of the lack of support
for the left outer join in the relational context). For this, we adopt
the semantics of the EnTrust operator for the **boolean trust application
only** but not in general the annotation mechanism that you propose
in your work.

We believe that the main point worth noting here, is that a mapping
annotated with the false value is treated as absent from the mapping set,
and vice versa, a mapping that does not appear in the mapping set is
considered as untrusted (i.e., annotated with the false value). This
holds because of the strict semantics of the boolean trust
application.

By no means, do we change the semantics of SPARQL!.  As you can see in
Table VIII-b \mu_17 in \Omega_4 is annotated with false, i.e. \Omega_4
is deemed as empty. Therefore, the result of \Omega_1 LeftOuterJoin
\Omega_4 = \Omega_1 as shown in Table VIII-c.  Mappings annotated with
false can be ommitted from the Tables, since they are treated as absent
(as we explain in the paper), however we included them for
presentation reasons.

Finally, note that in the paper we use boolean and ranked
trust as indicative examples from a large body of applications ( bag
semantics, view maintenance & update, access control, probabilistic
databases) to discuss requirements of provenance models. These cannot
be supported by an annotation mechanism defined for the trust
assessment application only.


 > Further comments:
 >
 > * It is not clear what you mean by "abstract provenance model". 
Sec.2.1 does
 > not give an appropriate definition. It is also not clear what a 
"provenance
 > expression" is or what you understand of a "abstract operations" (second
 > sentence in Sec.2.2).
 >

We introduce the term "abstract provenance model" to distinguish the
provenance models from the different annotation models.

As we state in the paper, "[...] abstract provenance models to capture
the relationship of query results with source data along with the
query operators that combined them". An abstract provenance model
consists of provenance tokens (i.e., source data annotations) and
provenance operators that record the query operators. A provenance
expression is an expression that involves those abstract tokens and
operators.

Depending on the application, one can evaluate these abstract provenance
expressions by substituting the tokens with concrete values, and the
abstract operators with concrete ones. As we discuss in the paper, in
the case of boolean trust, the former can be substituted with
true/false, and the latter with conjunction or disjunction.  For
ranked trust, the former can be positive integers whereas the latter
the max, and '+' functions on those.

The model that you propose in tSPARQL resembles to the best of our
understanding to an abstract provenance model since at the end, the
triples are annotated with expressions and not with values true/false.
The EnTrust model is closer to annotation computation than the former.

 > * As an alternative approach to the use of an "abstract provenance 
model" you
 > discuss the annotation of source data with values (Sec.2.2). That's,
 > basically, what I do with tSPARQL, only that the trust values in my 
case are
 > not assumed to be fixed (i.e. calculated once) and global (i.e. not
 > subjective). However, it might seem a bit strange that you base your 
whole
 > argumentation in Sec.4.3 on my annotation based approach when you 
state in
 > Sec.2.2 that annotation based approaches are unsuitable in the context of
 > Linked Data.

 > * In the first paragraph of Sec.3.1 you introduce a boolean trust based
 > example. You may want to mention here that a more 
sceptical/pessimistic user
 > may associate both operators \oplus and \odot with the a logical AND.

This is exactly the benefit of an abstract provenance model when
compared to annotation-based models! One does not need for every
application and user to compute and store the provenance of the
result! as it would be the case with annotation computations. One
simply has to choose the appropriate tokens and operators and perform
the computation once.

 > * In Sec.4.4 you write that for DESCRIBE queries "the output contains all
 > triples that have that value as a subject, predicate or object." 
That's not
 > true. The SPARQL spec does not prescribe what exactly the result of a
 > DESCRIBE query is.

As far as the semantics of the DESCRIBE SPARQL operator are concerned:
there is no formal definition of the semantics, but according to the
SPARQL standard, the informal semantics are captured by what we
discuss in the paper. [See 10.4 Descriptions of Resources at
http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050419/]


 > * In the paper you mention several times that you aim to find a 
provenance
 > model for Linked Data or for SPARQL queries over Linked Data. 
However, that's
 > not what you discuss in the paper. Everything that you do in the paper is
 > related to SPARQL queries. There is nothing Linked Data specific 
about this.
 > The execution of SPARQL queries over Linked Data is only a good use 
case for
 > your work.

Well, Linked Data is expressed in RDF which are queried with SPARQL. 
Linked Data is
a global dataspace where data from different sources are integrated and 
accessed
by a large set of users. Consequently, Linked Data is an excellent 
motivation
for provenance applications with requirements that cannot be fully addressed
by annotation-based models as we clearly discuss in the paper.


 > Please don't take these comments as a general deprecation of your work.
 > I really like your analysis; it is a very valuable contribution!
 >
 > Greetings,
 > Olaf

Best,
Irini
Received on Thursday, 20 May 2010 08:45:12 UTC