Re: On Provenance for Queries for Web Data from Olaf Hartig on 2010-05-20 (public-xg-prov@w3.org from May 2010)

From: Olaf Hartig <hartig@informatik.hu-berlin.de>
Date: Thu, 20 May 2010 16:34:57 +0200
To: Irini Fundulaki <fundul@ics.forth.gr>
Cc: public-xg-prov@w3.org, Vassilis Christophides <christop@ics.forth.gr>, Grigoris Karvounarakis <gregkar@gmail.com>, Yannis Theoharis <ytheohar@gmail.com>
Message-Id: <201005201635.01882.hartig@informatik.hu-berlin.de>
Hey Irini,

On Thursday 20 May 2010 10:50:05 Irini Fundulaki wrote:
> [...]
> We believe that you might have misread the paper. Our objective is to
> understand the requirements for the boolean trust application when the
> SPARQL OPTIONAL operator is concerned (because of the lack of support
> for the left outer join in the relational context). For this, we adopt
> the semantics of the EnTrust operator for the **boolean trust application
> only** but not in general the annotation mechanism that you propose
> in your work.

I understood that you present an example in which you adopt the tSPARQL 
semantics (which is based on trust values in the interval [-1,1] associated 
with each solution mapping) to the application of boolean trust (which 
associates a boolean value with each solution mapping). 
 
> We believe that the main point worth noting here, is that a mapping
> annotated with the false value is treated as absent from the mapping set,
> and vice versa, a mapping that does not appear in the mapping set is
> considered as untrusted (i.e., annotated with the false value). This
> holds because of the strict semantics of the boolean trust
> application.

If that's you assumption, fine. But it's a wrong adaptation of the tSPARQL 
semantics. A boolean-based adaptation of the tSPARQL algebra would not remove 
a solution mapping only because it is associated with the trust value 'false'. 
Removing these mapping would only happen if there are EnTrust operators in the 
algebra expression. That's not the case in your examples.
Users of tSPARQL have to add EnTrust operators (by adding ENSURE TRUST clauses 
to the query) to explicitly declare for which parts of their query the output 
mappings have to be ignored if they are not trustworthy. Hence, users have the 
choice. That's not the case in your adaptation where each mapping that is not 
trustworthy is ignored immediately. What you seem to assume for you adaptation 
of the tSPARQL semantics is that each operator in an algebra expression is 
wrapped in an EnTrust operator. However, that's not the idea of tSPARQL and, 
thus, that's not what should be referred to as the "trust semantics" 
introduced by the tSPARQL document as you do in your paper.

> By no means, do we change the semantics of SPARQL!.  As you can see in
> Table VIII-b \mu_17 in \Omega_4 is annotated with false, i.e. \Omega_4
> is deemed as empty. Therefore, the result of \Omega_1 LeftOuterJoin
> \Omega_4 = \Omega_1 as shown in Table VIII-c.  Mappings annotated with
> false can be ommitted from the Tables, since they are treated as absent
> (as we explain in the paper), however we included them for
> presentation reasons.

Ah, now I see your point: in the case mu_17 is assigned the trust value 
'false' during a trust assessment that implements your provenance based 
approach (i.e. that makes use of the provenance expression), you assume this 
mapping did not exist during query execution and that's why the query engine 
would calculate mu_21 and mu_20.
However, as far as I understand you propose to do these trust assessments at 
an arbitrary time when the query results have been determined. This means, 
when the query was executed mu_17 did exist (because by that time it is not 
clear what trust value would be associated to it in a later trust assessment 
procedure). For that reason, the query engine would never calculate mu_21. 
Hence, the query could never attach a provenance expression to mu_21 - there 
is no mu_21.

> Finally, note that in the paper we use boolean and ranked
> trust as indicative examples from a large body of applications ( bag
> semantics, view maintenance & update, access control, probabilistic
> databases) to discuss requirements of provenance models. These cannot
> be supported by an annotation mechanism defined for the trust
> assessment application only.

Sure. Sec.2.2 makes the advantages of applying a provenance based approach 
instead of an annotation based approach quite clear. There was no confusion 
about that on my side.

> [...]
> We introduce the term "abstract provenance model" to distinguish the
> provenance models from the different annotation models.

Okay. However, aren't "abstract provenance models" a special kind of 
annotation models. They annotate the source data and solutions with a 
provenance expression.

> [...]
> The model that you propose in tSPARQL resembles to the best of our
> understanding to an abstract provenance model since at the end, the
> triples are annotated with expressions and not with values true/false.

Yes - only that it is not the triples that are annotated with expressions but 
the solution mappings.
The restriction of the tSPARQL idea in contrast to your general idea of 
abstract provenance models is that the tSPARQL engine evaluates the 
expressions in the end as part of the query execution; that's why the tSPARQL 
approach has some of the disadvantages you outline for annotation based 
approaches in Sec.2.2

> The EnTrust model is closer to annotation computation than the former.

???
What do you mean by "EnTrust model"? The EnTrust operator is part of tSPARQL. 

> [...]
>  > * In the first paragraph of Sec.3.1 you introduce a boolean trust based
>  > example. You may want to mention here that a more sceptical/pessimistic
>  > user may associate both operators \oplus and \odot with the a logical
>  > AND.
> 
> This is exactly the benefit of an abstract provenance model when
> compared to annotation-based models! One does not need for every
> application and user to compute and store the provenance of the
> result! as it would be the case with annotation computations. One
> simply has to choose the appropriate tokens and operators and perform
> the computation once.

Yes, I did understand this.

>  > * In Sec.4.4 you write that for DESCRIBE queries "the output contains
>  > all triples that have that value as a subject, predicate or object."
>  > That's not true. The SPARQL spec does not prescribe what exactly the
>  > result of a DESCRIBE query is.
> 
> As far as the semantics of the DESCRIBE SPARQL operator are concerned:
> there is no formal definition of the semantics, but according to the
> SPARQL standard, the informal semantics are captured by what we
> discuss in the paper. [See 10.4 Descriptions of Resources at
> http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050419/]

No. Even the informal semantics (http://www.w3.org/TR/rdf-sparql-
query/#descriptionsOfResources) says: "The RDF returned [...] may include 
information about other resources: for example, the RDF data for a book may 
also include details about the author." 
The query result in the example in 10.4.3 of the SPARQL spec includes triples 
that don't contain the blank node that was bound to ?x.

>  > * In the paper you mention several times that you aim to find a provenance
>  > model for Linked Data or for SPARQL queries over Linked Data. However,
>  > that's not what you discuss in the paper. Everything that you do in the
>  > paper is related to SPARQL queries. There is nothing Linked Data specific 
>  > about this.
>  > The execution of SPARQL queries over Linked Data is only a good use case
>  > for your work.
> 
> Well, Linked Data is expressed in RDF which are queried with SPARQL.
> Linked Data is a global dataspace where data from different sources are
> integrated and accessed by a large set of users. Consequently, Linked Data
> is an excellent motivation for provenance applications with requirements
> that cannot be fully addressed by annotation-based models as we clearly
> discuss in the paper.

Sure, it is an excellent motivation. However, you don't work on a provenance 
model for Linked Data as you write in your Conclusions section.

Greetings,
Olaf
Received on Thursday, 20 May 2010 14:36:31 UTC