Proposed text for provenance section

Comments welcome.  I've tried to represent the various points of view.


\section{Provenance}

In both the library and broad data and Semantic Web domains, it is important to know where data came from, in order to be able to make judgements about what data to trust, and to find the cause of errors or changes that may occur in the data.  Although the underlying concept is the same, the different communities do mean different things when they talk of `provenance' so these are explained below.


\subsection{The Library Domain}

The library (or rather the cultural heritage -- museums and archives) community uses the term `provenance' to mean the record of ownership of the primary object rather than the metadata.

No one who actually manages archives expects to track changes to the metadata over time.  In traditional library/information management systems logs are kept around to track metadata changes temporarily, but it's just not considered important to the core mission of managing the \emph{content} over time.  Schemas change, contexts change, resources get described in myriad ways (all at the same time), people make mistakes, fix them, we add stuff, we remove stuff, and libraries do not track all this.

Example provenance data might look like something along these lines:

\begin{quote}
Object hdl:1271.1/1234 was submitted by user Robert Tansley <rtansley@fake.com> on 03-July-1999, and contained a single PDF file of 12515 bytes, checksum XXXXX.

A migration was performed on the PDF in hdl:1271.1/1234 by MacKenzie Smith using pdf2pdfa version 1.4, on 24-July-2003.  The PDFA produced has 15035 bytes and a checksum of YYYYY.
\end{quote}


\subsection{The Data Domain}

One of the key differences between the Semantic Web and pre-existing systems is that the Semantic Web relies on using metadata from many disparate sources, rather than having a centrally managed store of metadata information. This means it is important to consider where the metadata came from and who authored it.  This information is important because it enables the system processing the metadata to make decisions about how to use it, for example if it possesses several varying versions of metadata about the same object.

This idea of keeping track of where metadata came from is called `provenance' by the Semantic Web community.  It can also be termed keeping track of the `source' of metadata -- this is useful to distinguish it from provenance in the library domain.

In order to be sure of the source of a statement it may be necessary to use additional technologies e.g. cryptographically ensure that the originator information is correct and that the metadata has not been tampered with. Once the metadata has been ingested by the system, the system can also make choices about how to represent the source information e.g. by reifying individual statements or whether adopting representations like quads that record the origin of individual statements.

A related concept is \emph{context}.  There is the idea that an RDF statement may be true only in a certain context.  This context may be temporal (``was true at this time'',) or spatial, or may be to do with the author of the statement.  Thus the concept of contexts has a wider scope than just the source of the metadata, but is strongly related.



 Robert Tansley / Hewlett-Packard Laboratories / (+1) 617 551 7624

Received on Thursday, 3 July 2003 15:03:10 UTC