Original History Implementation Considerations

Forwarded for the record and referenceability.


Date: Mon, 28 Jan 2002 19:30:20 -0500 (EST)
Message-Id: <200201290030.TAA01485@melbourne-city-street.mit.edu>
From: Peter Breton <pbreton@MIT.EDU>
To: bass@mit.edu
Subject: history.requirements (I don't think has actually changed much)
Reply-to: pbreton@MIT.EDU


The purpose of the history subsystem is two-fold:

* Capture a time-based record of significant changes in DSpace, in a
  manner suitable for later refactoring or repurposing 

* To provide a corpus of data suitable for research by HP Labs and
  other interested parties

Note that the history data is not expected to provide current
information about the archive; it simply records what has happened in
the past.

Harmony Model

The Harmony project
describes a simple and powerful approach for modeling temporal data.

The DSpace history framework adopts this model.

The Harmony model is used by the serialization mechanism (and
ultimately by agents who interpret the serializations); users of the
History API need not be aware of it.

High-Level Approach

When anything of archival interest occurs in DSpace, a History object
is created. This object contains a reference to anything of archival

The history data component receives the object via either method calls
or Java event mechanisms. (Note that this does not preclude other
interested parties from acting on object as well). Upon reception
of the object, it serializes the state of all archive objects referred
to by it, and creates Harmony-style objects and associations to
describe the relationships between the objects. (A simple example is
given below). Note that each archive object must have a unique
identifier to allow linkage between discrete events; this is discussed
under "Unique Ids" below.

The serializations (including the Harmony objects and associations)
are persisted (via the database), and marked as history data.

Archival Events

The following events are significant enough to warrant history

  add Collection to Community
  add Item to Collection
  assign Handle to Item
  modify Item contents (Bundles, Bitstreams, metadata fields, etc)
  workflow completed


The serialization of an archival object consists of:

* Its instance fields (ie, non-static, non-transient fields)
* The serializations of associated objects (or references to these

The information necessary for a serialization must be provided either:

* Globally
* In the History object (or objects which are reachable from 
  History objects).

The implementation of serialization will use Java reflection to effect
serialization; however,  objects in the org.dspace package may be handled
specially. Some of them may simply be converted to Strings (eg, DSpaceDate), 
while  others may have custom serializations. (For example, EPerson passwords
should not be serialized).

An as-yet unsolved problem is how to propagate DSpace context
information to the History mechanism.

Version information for the serializer itself should be included in
the serialization!

Unique Ids

To be able to trace the history of an object, it is essential that the
object have a unique identifier.

After discussion, the unique identifiers are only weakly tied to the
Handle system. Instead, the identifier consists of:

* an identifer for the project
* a site id (using the handle prefix)
* an RDBMS-based id for objects

Why Synchronization Is Not a Problem

A classic problem with having data in two places is synchronization;
it is no longer always clear which data source is authoritative.

This is not a problem for the history data because:

* The data is read-only; once generated, it is never changed
* The data is temporal, and so it is only expected to be correct as
  of the time when it was generated.

Storage Considerations

The most naive approach to storage is simply to store everything,
without considering redundancy. This approach, while extremely simple,
tends to be very space intensive.

I propose a slightly more complex approach: when an archive object is
serialized, an object id and checksum are recorded. When another
object is serialized, the checksum for the serialization is matched
against existing checksums for that object. If the checksum already
exists, the object is not stored; a reference to the object is used

Note that since none of the serializations are deleted, ref counting
is unnecessary.

For Early Adopters, time to delivery will be the determining factor as
to whether a checksum approach is used.

History Maps

The history data is not initially stored in a queryable
form. Nonetheless, it is a good idea to provide at least basic
indications of what is stored, and where it is stored.

Therefore I propose the following simple RDBMS table implementation:

History table: 
  -- The history data
  data       TEXT,
  -- When the history data was created (this data is also in the history!)
  timestamp  TIMESTAMP

HistoryReference table: 
  history_reference_id INTEGER PRIMARY KEY,
  -- Reference to the history
  history_id           INTEGER FOREIGN KEY,
  -- Object Id
  object_id            VARCHAR(64),
  -- True if the history data only refers to the object
  refers_to            BOOLEAN

One way to trace the history of an object would be to find all history
serializations which refer to it (in the HistoryReference table), and
unwind and interpret these. When the history data refers to a
serialization of an object, use the History table to find the


An item is submitted to a collection via bulk upload. When (and if)
the Item is eventually added to the collection, an ItemSubmit history
method is called, with references to the Item, its Collection, the User who
performed the bulk upload, and some indication of the fact that it was
submitted via a bulk upload.

When called, the HistoryManager does the following: 
It creates the following new resources (all with unique ids):

  * An event
  * A state
  * An action 

It also generates the following relationships:

  event  --atTime-->     time
  event  --hasOutput-->  state
  Item   --inState-->    state
  state  --contains-->   Item
  action --creates-->    Item
  event  --hasAction-->  action
  action --usesTool-->   DSpace Upload
  action --hasAgent-->   User

The history component serializes the state of all archival objects
involved (in this case, the Item, the User, and the DSpace Upload). It
creates entries in the history map which associate the archival
objects with the generated serializations.

What History Data Is Not

History Data is not version control information. No effort has been
made to provide diffs, merges, or highly efficient storage; instead,
effort is focused on simple "remembrance". Note that this does not
preclude more sophisticated approaches later.

History Data does not attempt to reconcile any contradictions in the
data it serializes.

History Data does not keep track of any kind of "current state".

Mick Bass

Research and Business Development
HP Laboratories
Hewlett-Packard Company
1 Cambridge Center
Cambridge, MA 02142

617.551.7634 office    617.551.7650 fax
617.899.3938 mobile    617.627.9694 residence
bass@alum.mit.edu      mick_bass@hp.com

Received on Thursday, 22 May 2003 16:22:36 UTC