RE: Provenance in RDF from Hutchison, Nigel on 2002-02-27 (www-rdf-interest@w3.org from February 2002)

From: Hutchison, Nigel <Nigel.Hutchison@softwareag.com>
Date: Wed, 27 Feb 2002 14:40:11 +0100
To: "'Dave Reynolds'" <der@hplb.hpl.hp.com>
Cc: "RDF Interest (E-mail)" <www-rdf-interest@w3.org>
Message-ID: <DFF2AC9E3583D511A21F0008C7E621060279E91B@daemsg02.software-ag.de>
Another way of doing out of band provenance would be to treat the statement
itself as a resource.

Suppose every statement had a URI (U say)

then we have 

U===> subj --pred--> obj   (U references this statement as a resource)
for any provenanced (is that a word? :-) values use:
   U --pv:creator--> "Dave"
   U --pv:date--> "27/2/02"

The RDF API would have to have a method that returned the (unique) URI of
each statement.

Or is this totally out of band .-) It should work ok with our RDF
implementation but that's no excuse

regards

Nigel Hutchison

Nigel W.O Hutchison
Chief Scientist 
Software AG
Uhlandstr 12,D-64297 Darmstadt, Germany
+49 6151 92 1207




-----Original Message-----
From: Dave Reynolds [mailto:der@hplb.hpl.hp.com]
Sent: Wednesday, February 27, 2002 1:14 PM
To: RDF Interest (E-mail)
Subject: Provenance in RDF


We are working on a semantic web related application that needs some
provenance
support. We have various routes for doing this but would be interested in
hearing of other's experiences. Are there any groups out there that have
developed applications supporting provenance within RDF that would be
willing to
share their experiences on what worked well or badly?

To explain a little.

We are developing a semantic web application for shared information
management.
In this application users are able to attach personal metadata to items and
are
able to view the "soup" of metadata created by many users. For example the
same
item might have many different dc:title fields created by different users
and
the UI should be able to view this data and give response like 'most users
call
this "foo" but one user prefers to call it "bar"'. To support these we want
fine
grain tracking of where the multiple metadata values came from, down to the
level of individual RDF assertions. The tracking data could include items
like
creator, date and digital-signature, these terms would be defined in a
separate
provenance schema/ontology.

We are exploring three approaches to doing this - application level,
reification
and out-of-band. Each of these has pros and cons.

** Application level
Treat provenance as a data modeling problem at the application level and
introduce bNodes to which the provenance can be attached. Thus instead of:
   subj --pred--> obj
for any provenanced (is that a word? :-) values use:
   subj --pred--> <> --rdf:value--> obj
                     --pv:creator--> "Dave"
                     --pv:date--> "27/2/02"
This has the advantage of flexibility and means we can query provenance data
conveniently using existing RDF query languages (RDQL in our case). However,
as
far as we know this is not a standard idiom and that might make it harder to
interoperate with other RDF metadata sources.

** Reification
Clearly the official RDF mechanism for representing provenance is to use
reification and attach the same "pv:*" assertions to a node denoting the
reified
statement.
This has the advantage of being the standard idiom at present, however the
uncertain status of reification with the RDFCore WG leaves us nervous. We
can
still query provenance data, though the query would now look rather more
ugly
and verbose than if we take the application level approach. The shear number
of
triples needed is high but (a) is too early to optimize for performance and
(b)
we can in any case hide overhead by implementing a triple store which
pretends
to reify but in fact uses a more compact representation.

** Out of band
In this option we simply make provenance support a property of the API. We
don't
change the RDF assertions in the main fact base at all. Instead we provide
API
calls to attach and retrieve annotations from any RDF assertion. This is
related
to the "quad" notion discussed on this list some time ago and the N3
approach
that evey statement has an internal context attribute. This has the
advantage
that it hides the mechanics of provenance allowing us to keep the
application
code stable even if the implementation idiom changes. It has the
disadvantage
that we'd need to extend our query support to access this additional API
layer
and is at best unhelpful for integrating with other RDF data sources.

For our current purposes we will simply pick one and work with it but if
anyone
else has already trodden this path and has experiences to share then we'd
love
to hear from them.

Dave
Received on Wednesday, 27 February 2002 08:53:26 UTC