Re: Provenance for section 3 in technologies.tex from Kevin Smathers on 2003-06-25 (www-rdf-dspace@w3.org from June 2003)

From: Kevin Smathers <kevin.smathers@hp.com>
Date: Wed, 25 Jun 2003 14:15:45 -0700
To: "John S. Erickson" <john.erickson@hp.com>
Cc: www-rdf-dspace@w3.org
Message-ID: <3EFA1101.6090007@hp.com>
Quads aren't inherently representable in RDF although they can certainly 
be translated into reified statements.  There are a couple of issues 
here that we've just started exploring in Genesis.

First there is the overhead of reifying statements (according to spec), 
and then the complexity of querying the reified statements.  Both are 
problematic, but if you want your internal reification information to be 
carried out into the RDF document then you will have to choose some 
means of representing it.  Borrowing from the example of four-statement 
reification in RDF, Genesis does something similar for larger graphs by 
changing the direct statements of the graph into indirect statements 
tied together through a node that gives the combined graph of statements 
their identity.  For well known graphs this works pretty well; the graph 
can be identified, but the storage overhead is reduced to just a couple 
of extra statements per graph rather than a couple of extra statements 
per statement.  Our current prototype takes this approach.

My own favorite alternative is to make graph identity (or statement 
identity) equivalent to the wrapper that contains the serialized RDF.  
Round trip from RDF to internal quads sets the identity element of the 
quad based on e.g. the filename of the file (or attachment, or PGP 
signed block of text, etc.) which was read to create those statements.  
Assuming you can identify the source of the statements, you can recreate 
the quads.  The down side is that if you want each statement to have an 
identity which can be distinguished from the rest of its graph then you 
are left with the need to query or otherwise subindex from the source.  

Another alternative is to abandon round-trip through standard RDF and 
either add local extensions that represent identity of the statements in 
the serialized form, or else assign a new identity to the statements 
after each round-trip.

I think there are reasonable arguments to be made for any of these 
approaches.

Cheers,
-kls

John S. Erickson wrote:

>This looks pretty good, Mark!
>
>Might need to explode the definition of "provenance" in this context --- here
>you imply *some* definition of whereItCameFrom and whoAuthoredIt, but there
>might in fact be domain-specific definitions of "provenance objects" (i.e.
>aggregations of provenance-informing properties that are useful to a
>*particular* community).
>
>John
>
>----- Original Message ----- 
>From: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
>To: <www-rdf-dspace@w3.org>
>Sent: Wednesday, June 25, 2003 11:38 AM
>Subject: Provenance for section 3 in technologies.tex
>
>
>  
>
>>Some proposed text to describe metadata provenance in section 3 - any
>>comments?
>>
>>Metadata provenance: One of the key differences between the Semantic Web and
>>pre-existing systems is that the Semantic Web relies on using metadata from
>>many disparate sources, rather than having a centrally managed store of
>>metadata information. This means it is important to consider the provenance
>>of the metadata i.e. where it came from and who authored it. This
>>information is important because it enables the system processing the
>>metadata to make decisions about how to use it, for example if it possesses
>>several varying versions of metadata about the same object. In order to
>>guarantee provenance it may be necessary to use additional technologies e.g.
>>cryptographically ensure that the originator information is correct and that
>>the metadata has not been tampered with. Once the metadata has been ingested
>>by the system, the system can also make choices about how to represent the
>>provenance information e.g. by reifying individual statements or whether
>>adopting representations like quads that record the origin of individual
>>statements. Note that the usage of the term provenance is quite different to
>>its usage in the library community where it is used to refer to the record
>>of ownership of the item described by the metadata.
>>
>>
>>    
>>


-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");
Received on Wednesday, 25 June 2003 17:16:55 UTC