Provenance and properties of entities in RDF

I was aiming to draft something about entity attributes to flesh out the 
relationship between resources and provenance in PAQ, but ran into some 
devil-in-detail problems, which I'm trying to explore here...

This message arises in part from an off-line discussion with Stian, whom I thank 
for pointing me to important information about how the provenance model plays 
out in RDF, and for explaining some of the implications of this.  However, any 
errors and misapprehensions in what follows are all mine.


== Provenance and properties of entities in RDF ==

I've been looking at how provenance expressions may be represented in RDF, and 
how such representation interacts with attributes of an entity.

For the purpose of this discussion, I'll use a statement using dcterms:creator 
as an example:

   ex:aDocument a prov:Entity ;
     dcterms:creator "Meritorious Meerkat" .

The RDF statement with property dcterms:creator can be interpreted as an 
attribute of the entity, *and* as an expression of provenance about the entity.

To express the above as provenance using the provenance vocabulary as currently 
defined, we need to introduce a new class, a subclass of prov:ProcessExecution; e.g.

   ex:DocumentCreation rdfs:subclassOf prov:ProcessExecution .
   ex:aDocument a prov:Entity ;
     prov:wasGeneratedBy
       [ a ex:DocumentCreation ;
         prov:wasControlledBy
           [ a prov:Agent ;
             foaf:name "Meritorious Meerkat"
           ]
       ] .

I observe:
(a) this structure is quite similar to the sort of event-mediated structures 
that occur when using CIDOC-CRM [1].
(b) the structure is quite complex compared with the original example.

[1] http://www.cidoc-crm.org/docs/fin-paper.pdf

I'm not saying these are problems, but I am trying to explore the landscape from 
an implementer's perspective.

I think it is probably reasonable that applications with a special interest in 
generating and/or consuming provenance information - workflow enactment systems 
come to mind - may reasonably generate and work with the more complex format 
(though my experience with using CIDOC-CRM in RDF suggests that some additional 
steps may be needed if processing of this data is to scale - but I don't see 
that as a primary concern at this juncture).

My main concerns are that we also want to be able to capture and use provenance 
information that is generated incidentally by applications that don't have a 
primary interest in provenance, and the provenance information should similarly 
be accessible to applications that don't care for the intricacies of provenance 
information.  Such applications would easily generate and consume statements 
like the original using dcterms:creator, but may be less able to deal with the 
more complex provenance vocabulary structures.

In my mind, this raises the following questions:

(1) is the full complexity of the current provenance model structure actually 
needed?  I think it probably is, but I feel it's worth reflecting and asking the 
question.

(2) should we look to technical mechanisms to define the relationship between 
the simple provenance-as-attributes and fully-modeled provenance statements? 
(E.g., relating the two examples given above.)

(3) rather than defining an all-new vocabulary, should we consider basing the 
mapping of the abstract model to RDF on a subset of the CIDOC-CRM model 
structures?  (I don't think this would affect PROV-DM, but could affect many of 
the terms used in PROV-O, and cause some of the mapped structures in RDF to change.)

At the very least, and I think this echos Ivan Herman's recent email to the 
group [2], I think we need to find a way to make it clear how the simple 
attributes can be related to the defined provenance model, and maybe provide 
some guidelines to help provenance-aware applications to interpret and/or 
generate simple attributes that happen to express provenance information.

[2] http://lists.w3.org/Archives/Public/public-prov-wg/2011Oct/0140.html

#g

Received on Thursday, 20 October 2011 13:20:17 UTC