Re: On Provenance for Queries for Web Data

Olaf, all

On 5/20/10 5:34 PM, Olaf Hartig wrote:
 > Hey Irini,
 >
 > On Thursday 20 May 2010 10:50:05 Irini Fundulaki wrote:
 >
 >> We believe that the main point worth noting here, is that a mapping
 >> annotated with the false value is treated as absent from the mapping 
set,
 >> and vice versa, a mapping that does not appear in the mapping set is
 >> considered as untrusted (i.e., annotated with the false value). This
 >> holds because of the strict semantics of the boolean trust
 >> application.
 >>
 > If that's you assumption, fine. But it's a wrong adaptation of the 
tSPARQL
 > semantics. A boolean-based adaptation of the tSPARQL algebra would 
not remove
 > a solution mapping only because it is associated with the trust value 
'false'.
 > Removing these mapping would only happen if there are EnTrust 
operators in the
 > algebra expression. That's not the case in your examples.
 > Users of tSPARQL have to add EnTrust operators (by adding ENSURE 
TRUST clauses
 > to the query) to explicitly declare for which parts of their query 
the output
 > mappings have to be ignored if they are not trustworthy. Hence, users 
have the
 > choice. That's not the case in your adaptation where each mapping 
that is not
 > trustworthy is ignored immediately. What you seem to assume for you 
adaptation
 > of the tSPARQL semantics is that each operator in an algebra 
expression is
 > wrapped in an EnTrust operator. However, that's not the idea of 
tSPARQL and,
 > thus, that's not what should be referred to as the "trust semantics"
 > introduced by the tSPARQL document as you do in your paper.
 >
 >

We would like to clarify that we are using the semantics of the
EnTrust operator (defined in the tSPARQL specification document) for
the boolean trust application when SPARQL OPTIONAL is used.  At this
point, we are discussing the requirements that boolean trust
application prescribes to a provenance model. The semantics for the
boolean trust application are clear to us from existing work on
relational data for positive relational algebra queries (and
consequently the SPARQL fragment without OPTIONAL). To this end, we
adopted the semantics of the EnTrust Operator for OPTIONAL.

 >> By no means, do we change the semantics of SPARQL!.  As you can see in
 >> Table VIII-b \mu_17 in \Omega_4 is annotated with false, i.e. \Omega_4
 >> is deemed as empty. Therefore, the result of \Omega_1 LeftOuterJoin
 >> \Omega_4 = \Omega_1 as shown in Table VIII-c.  Mappings annotated with
 >> false can be ommitted from the Tables, since they are treated as absent
 >> (as we explain in the paper), however we included them for
 >> presentation reasons.
 >>
 > Ah, now I see your point: in the case mu_17 is assigned the trust value
 > 'false' during a trust assessment that implements your provenance based
 > approach (i.e. that makes use of the provenance expression), you 
assume this
 > mapping did not exist during query execution and that's why the query 
engine
 > would calculate mu_21 and mu_20.
 > However, as far as I understand you propose to do these trust 
assessments at
 > an arbitrary time when the query results have been determined. This 
means,
 > when the query was executed mu_17 did exist (because by that time it 
is not
 > clear what trust value would be associated to it in a later trust 
assessment
 > procedure). For that reason, the query engine would never calculate 
mu_21.
 > Hence, the query could never attach a provenance expression to mu_21 
- there
 > is no mu_21.
 >
 >

In Table VIII we discuss only the requirements that boolean trust
application prescribes to a provenance model.  Now, concerning the
evaluation of queries for provenance aware applications and for
abstract provenance models: In general, query language operators work
with the support of relations (mapping bags in the case of SPARQL):
the relation's tuples.  As shown in [1], the support of an annotated
relation consists of all the tuples that are not annotated with the
neutral element ("0") of abstract sum ("+"). "0" is substituted by
"false" in the case of the boolean trust semiring (see [1]).  In this
respect, the support of the relation changes when the annotation
tokens change but not the semantics of the employed query language.

The solution that we propose is to capture the semantics of the
OPTIONAL operator by recording the provenance expressions for some
mappings that do not belong to the support of the query result (that
will eventually evaluate to "false" for boolean trust). At the end,
the evaluation of the provenance expressions determines the support of
the query result.

To conclude, first we evaluate the query under the query language
semantics, and then the evaluation of provenance expressions will
determine the support of the query result.

[1] T. J. Green, G. Karvounarakis, V. Tannen. Provenance Semirings. PODS 
2007.


 >> [...]
 >> We introduce the term "abstract provenance model" to distinguish the
 >> provenance models from the different annotation models.
 >>
 > Okay. However, aren't "abstract provenance models" a special kind of
 > annotation models. They annotate the source data and solutions with a
 > provenance expression.
 >
 >

True.  But we tried to make the distinction more clear since
in the case of abstract provenance models the annotations are expressions
on tokens.

 >
 >> >  * In Sec.4.4 you write that for DESCRIBE queries "the output contains
 >> >  all triples that have that value as a subject, predicate or object."
 >> >  That's not true. The SPARQL spec does not prescribe what exactly the
 >> >  result of a DESCRIBE query is.
 >>
 >> As far as the semantics of the DESCRIBE SPARQL operator are concerned:
 >> there is no formal definition of the semantics, but according to the
 >> SPARQL standard, the informal semantics are captured by what we
 >> discuss in the paper. [See 10.4 Descriptions of Resources at
 >> http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050419/]
 >>
 > No. Even the informal semantics (http://www.w3.org/TR/rdf-sparql-
 > query/#descriptionsOfResources) says: "The RDF returned [...] may include
 > information about other resources: for example, the RDF data for a 
book may
 > also include details about the author."
 > The query result in the example in 10.4.3 of the SPARQL spec includes 
triples
 > that don't contain the blank node that was bound to ?x.
 >
 >
We will have a closer look at the formal semantics for DESCRIBE
when they will be available. Thanks for letting us know of this detail.


 >> >  * In the paper you mention several times that you aim to find a 
provenance
 >> >  model for Linked Data or for SPARQL queries over Linked Data. 
However,
 >> >  that's not what you discuss in the paper. Everything that you do 
in the
 >> >  paper is related to SPARQL queries. There is nothing Linked Data 
specific
 >> >  about this.
 >> >  The execution of SPARQL queries over Linked Data is only a good 
use case
 >> >  for your work.
 >>
 >> Well, Linked Data is expressed in RDF which are queried with SPARQL.
 >> Linked Data is a global dataspace where data from different sources are
 >> integrated and accessed by a large set of users. Consequently, 
Linked Data
 >> is an excellent motivation for provenance applications with requirements
 >> that cannot be fully addressed by annotation-based models as we clearly
 >> discuss in the paper.
 >>
 > Sure, it is an excellent motivation. However, you don't work on a 
provenance
 > model for Linked Data as you write in your Conclusions section.
 >
 >

So, what are the data provenance requirements for Linked Data that are
not addressed by the provenance models discussed in the paper?

 > Greetings,
 > Olaf
 >

Received on Friday, 21 May 2010 13:20:24 UTC