# Re: On Provenance for Queries for Web Data

From: Olaf Hartig <hartig@informatik.hu-berlin.de>
Date: Thu, 20 May 2010 16:34:57 +0200
To: Irini Fundulaki <fundul@ics.forth.gr>
Cc: public-xg-prov@w3.org, Vassilis Christophides <christop@ics.forth.gr>, Grigoris Karvounarakis <gregkar@gmail.com>, Yannis Theoharis <ytheohar@gmail.com>
Message-Id: <201005201635.01882.hartig@informatik.hu-berlin.de>
Hey Irini,

On Thursday 20 May 2010 10:50:05 Irini Fundulaki wrote:
> [...]
> We believe that you might have misread the paper. Our objective is to
> understand the requirements for the boolean trust application when the
> SPARQL OPTIONAL operator is concerned (because of the lack of support
> for the left outer join in the relational context). For this, we adopt
> the semantics of the EnTrust operator for the **boolean trust application
> only** but not in general the annotation mechanism that you propose

I understood that you present an example in which you adopt the tSPARQL
semantics (which is based on trust values in the interval [-1,1] associated
with each solution mapping) to the application of boolean trust (which
associates a boolean value with each solution mapping).

> We believe that the main point worth noting here, is that a mapping
> annotated with the false value is treated as absent from the mapping set,
> and vice versa, a mapping that does not appear in the mapping set is
> considered as untrusted (i.e., annotated with the false value). This
> holds because of the strict semantics of the boolean trust
> application.

If that's you assumption, fine. But it's a wrong adaptation of the tSPARQL
semantics. A boolean-based adaptation of the tSPARQL algebra would not remove
a solution mapping only because it is associated with the trust value 'false'.
Removing these mapping would only happen if there are EnTrust operators in the
algebra expression. That's not the case in your examples.
Users of tSPARQL have to add EnTrust operators (by adding ENSURE TRUST clauses
to the query) to explicitly declare for which parts of their query the output
mappings have to be ignored if they are not trustworthy. Hence, users have the
choice. That's not the case in your adaptation where each mapping that is not
trustworthy is ignored immediately. What you seem to assume for you adaptation
of the tSPARQL semantics is that each operator in an algebra expression is
wrapped in an EnTrust operator. However, that's not the idea of tSPARQL and,
thus, that's not what should be referred to as the "trust semantics"
introduced by the tSPARQL document as you do in your paper.

> By no means, do we change the semantics of SPARQL!.  As you can see in
> Table VIII-b \mu_17 in \Omega_4 is annotated with false, i.e. \Omega_4
> is deemed as empty. Therefore, the result of \Omega_1 LeftOuterJoin
> \Omega_4 = \Omega_1 as shown in Table VIII-c.  Mappings annotated with
> false can be ommitted from the Tables, since they are treated as absent
> (as we explain in the paper), however we included them for
> presentation reasons.

Ah, now I see your point: in the case mu_17 is assigned the trust value
'false' during a trust assessment that implements your provenance based
approach (i.e. that makes use of the provenance expression), you assume this
mapping did not exist during query execution and that's why the query engine
would calculate mu_21 and mu_20.
However, as far as I understand you propose to do these trust assessments at
an arbitrary time when the query results have been determined. This means,
when the query was executed mu_17 did exist (because by that time it is not
clear what trust value would be associated to it in a later trust assessment
procedure). For that reason, the query engine would never calculate mu_21.
Hence, the query could never attach a provenance expression to mu_21 - there
is no mu_21.

> Finally, note that in the paper we use boolean and ranked
> trust as indicative examples from a large body of applications ( bag
> semantics, view maintenance & update, access control, probabilistic
> databases) to discuss requirements of provenance models. These cannot
> be supported by an annotation mechanism defined for the trust
> assessment application only.

Sure. Sec.2.2 makes the advantages of applying a provenance based approach
instead of an annotation based approach quite clear. There was no confusion

> [...]
> We introduce the term "abstract provenance model" to distinguish the
> provenance models from the different annotation models.

Okay. However, aren't "abstract provenance models" a special kind of
annotation models. They annotate the source data and solutions with a
provenance expression.

> [...]
> The model that you propose in tSPARQL resembles to the best of our
> understanding to an abstract provenance model since at the end, the
> triples are annotated with expressions and not with values true/false.

Yes - only that it is not the triples that are annotated with expressions but
the solution mappings.
The restriction of the tSPARQL idea in contrast to your general idea of
abstract provenance models is that the tSPARQL engine evaluates the
expressions in the end as part of the query execution; that's why the tSPARQL
approach has some of the disadvantages you outline for annotation based
approaches in Sec.2.2

> The EnTrust model is closer to annotation computation than the former.

???
What do you mean by "EnTrust model"? The EnTrust operator is part of tSPARQL.

> [...]
>  > * In the first paragraph of Sec.3.1 you introduce a boolean trust based
>  > example. You may want to mention here that a more sceptical/pessimistic
>  > user may associate both operators \oplus and \odot with the a logical
>  > AND.
>
> This is exactly the benefit of an abstract provenance model when
> compared to annotation-based models! One does not need for every
> application and user to compute and store the provenance of the
> result! as it would be the case with annotation computations. One
> simply has to choose the appropriate tokens and operators and perform
> the computation once.

Yes, I did understand this.

>  > * In Sec.4.4 you write that for DESCRIBE queries "the output contains
>  > all triples that have that value as a subject, predicate or object."
>  > That's not true. The SPARQL spec does not prescribe what exactly the
>  > result of a DESCRIBE query is.
>
> As far as the semantics of the DESCRIBE SPARQL operator are concerned:
> there is no formal definition of the semantics, but according to the
> SPARQL standard, the informal semantics are captured by what we
> discuss in the paper. [See 10.4 Descriptions of Resources at
> http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050419/]

No. Even the informal semantics (http://www.w3.org/TR/rdf-sparql-
query/#descriptionsOfResources) says: "The RDF returned [...] may include
information about other resources: for example, the RDF data for a book may
also include details about the author."
The query result in the example in 10.4.3 of the SPARQL spec includes triples
that don't contain the blank node that was bound to ?x.

>  > * In the paper you mention several times that you aim to find a provenance
>  > model for Linked Data or for SPARQL queries over Linked Data. However,
>  > that's not what you discuss in the paper. Everything that you do in the
>  > paper is related to SPARQL queries. There is nothing Linked Data specific
>  > The execution of SPARQL queries over Linked Data is only a good use case
>
> Well, Linked Data is expressed in RDF which are queried with SPARQL.
> Linked Data is a global dataspace where data from different sources are
> integrated and accessed by a large set of users. Consequently, Linked Data
> is an excellent motivation for provenance applications with requirements
> that cannot be fully addressed by annotation-based models as we clearly
> discuss in the paper.

Sure, it is an excellent motivation. However, you don't work on a provenance