# Re: On Provenance for Queries for Web Data

From: Irini Fundulaki <fundul@ics.forth.gr>
Date: Fri, 21 May 2010 16:25:01 +0300
To: Olaf Hartig <hartig@informatik.hu-berlin.de>
CC: public-xg-prov@w3.org, Vassilis Christophides <christop@ics.forth.gr>, Grigoris Karvounarakis <gregkar@gmail.com>, Yannis Theoharis <ytheohar@gmail.com>
Olaf, all

On 5/20/10 5:34 PM, Olaf Hartig wrote:
> Hey Irini,
>
> On Thursday 20 May 2010 10:50:05 Irini Fundulaki wrote:
>
>> We believe that the main point worth noting here, is that a mapping
>> annotated with the false value is treated as absent from the mapping
set,
>> and vice versa, a mapping that does not appear in the mapping set is
>> considered as untrusted (i.e., annotated with the false value). This
>> holds because of the strict semantics of the boolean trust
>> application.
>>
> If that's you assumption, fine. But it's a wrong adaptation of the
tSPARQL
> semantics. A boolean-based adaptation of the tSPARQL algebra would
not remove
> a solution mapping only because it is associated with the trust value
'false'.
> Removing these mapping would only happen if there are EnTrust
operators in the
> algebra expression. That's not the case in your examples.
> Users of tSPARQL have to add EnTrust operators (by adding ENSURE
TRUST clauses
> to the query) to explicitly declare for which parts of their query
the output
> mappings have to be ignored if they are not trustworthy. Hence, users
have the
> choice. That's not the case in your adaptation where each mapping
that is not
> trustworthy is ignored immediately. What you seem to assume for you
> of the tSPARQL semantics is that each operator in an algebra
expression is
> wrapped in an EnTrust operator. However, that's not the idea of
tSPARQL and,
> thus, that's not what should be referred to as the "trust semantics"
> introduced by the tSPARQL document as you do in your paper.
>
>

We would like to clarify that we are using the semantics of the
EnTrust operator (defined in the tSPARQL specification document) for
the boolean trust application when SPARQL OPTIONAL is used.  At this
point, we are discussing the requirements that boolean trust
application prescribes to a provenance model. The semantics for the
boolean trust application are clear to us from existing work on
relational data for positive relational algebra queries (and
consequently the SPARQL fragment without OPTIONAL). To this end, we
adopted the semantics of the EnTrust Operator for OPTIONAL.

>> By no means, do we change the semantics of SPARQL!.  As you can see in
>> Table VIII-b \mu_17 in \Omega_4 is annotated with false, i.e. \Omega_4
>> is deemed as empty. Therefore, the result of \Omega_1 LeftOuterJoin
>> \Omega_4 = \Omega_1 as shown in Table VIII-c.  Mappings annotated with
>> false can be ommitted from the Tables, since they are treated as absent
>> (as we explain in the paper), however we included them for
>> presentation reasons.
>>
> Ah, now I see your point: in the case mu_17 is assigned the trust value
> 'false' during a trust assessment that implements your provenance based
> approach (i.e. that makes use of the provenance expression), you
assume this
> mapping did not exist during query execution and that's why the query
engine
> would calculate mu_21 and mu_20.
> However, as far as I understand you propose to do these trust
assessments at
> an arbitrary time when the query results have been determined. This
means,
> when the query was executed mu_17 did exist (because by that time it
is not
> clear what trust value would be associated to it in a later trust
assessment
> procedure). For that reason, the query engine would never calculate
mu_21.
> Hence, the query could never attach a provenance expression to mu_21
- there
> is no mu_21.
>
>

In Table VIII we discuss only the requirements that boolean trust
application prescribes to a provenance model.  Now, concerning the
evaluation of queries for provenance aware applications and for
abstract provenance models: In general, query language operators work
with the support of relations (mapping bags in the case of SPARQL):
the relation's tuples.  As shown in [1], the support of an annotated
relation consists of all the tuples that are not annotated with the
neutral element ("0") of abstract sum ("+"). "0" is substituted by
"false" in the case of the boolean trust semiring (see [1]).  In this
respect, the support of the relation changes when the annotation
tokens change but not the semantics of the employed query language.

The solution that we propose is to capture the semantics of the
OPTIONAL operator by recording the provenance expressions for some
mappings that do not belong to the support of the query result (that
will eventually evaluate to "false" for boolean trust). At the end,
the evaluation of the provenance expressions determines the support of
the query result.

To conclude, first we evaluate the query under the query language
semantics, and then the evaluation of provenance expressions will
determine the support of the query result.

[1] T. J. Green, G. Karvounarakis, V. Tannen. Provenance Semirings. PODS
2007.

>> [...]
>> We introduce the term "abstract provenance model" to distinguish the
>> provenance models from the different annotation models.
>>
> Okay. However, aren't "abstract provenance models" a special kind of
> annotation models. They annotate the source data and solutions with a
> provenance expression.
>
>

True.  But we tried to make the distinction more clear since
in the case of abstract provenance models the annotations are expressions
on tokens.

>
>> >  * In Sec.4.4 you write that for DESCRIBE queries "the output contains
>> >  all triples that have that value as a subject, predicate or object."
>> >  That's not true. The SPARQL spec does not prescribe what exactly the
>> >  result of a DESCRIBE query is.
>>
>> As far as the semantics of the DESCRIBE SPARQL operator are concerned:
>> there is no formal definition of the semantics, but according to the
>> SPARQL standard, the informal semantics are captured by what we
>> discuss in the paper. [See 10.4 Descriptions of Resources at
>> http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050419/]
>>
> No. Even the informal semantics (http://www.w3.org/TR/rdf-sparql-
> query/#descriptionsOfResources) says: "The RDF returned [...] may include
> information about other resources: for example, the RDF data for a
book may
> also include details about the author."
> The query result in the example in 10.4.3 of the SPARQL spec includes
triples
> that don't contain the blank node that was bound to ?x.
>
>
We will have a closer look at the formal semantics for DESCRIBE
when they will be available. Thanks for letting us know of this detail.

>> >  * In the paper you mention several times that you aim to find a
provenance
>> >  model for Linked Data or for SPARQL queries over Linked Data.
However,
>> >  that's not what you discuss in the paper. Everything that you do
in the
>> >  paper is related to SPARQL queries. There is nothing Linked Data
specific
>> >  The execution of SPARQL queries over Linked Data is only a good
use case
>>
>> Well, Linked Data is expressed in RDF which are queried with SPARQL.
>> Linked Data is a global dataspace where data from different sources are
>> integrated and accessed by a large set of users. Consequently,
>> is an excellent motivation for provenance applications with requirements
>> that cannot be fully addressed by annotation-based models as we clearly
>> discuss in the paper.
>>
> Sure, it is an excellent motivation. However, you don't work on a
provenance
> model for Linked Data as you write in your Conclusions section.
>
>

So, what are the data provenance requirements for Linked Data that are
not addressed by the provenance models discussed in the paper?

> Greetings,
> Olaf
>

Received on Friday, 21 May 2010 13:20:24 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 21 May 2010 13:20:25 GMT