More History comments from Tansley, Robert on 2003-05-22 (www-rdf-dspace@w3.org from May 2003)

From: Tansley, Robert <robert.tansley@hp.com>
Date: Thu, 22 May 2003 08:45:10 -0700
To: " (www-rdf-dspace@w3.org)" <www-rdf-dspace@w3.org>
Message-ID: <40700B4C02ABD5119F00009027876644EA3924@hplex1.hpl.hp.com>

Here's my latest salvo of comments on the History system work. It's quite possible that many of the comments/issues below are outside of the scope of Jason's work; however, I think they serve to illustrate that the History system is a really, really complex use case and perhaps needs a great deal of simplification before we can make it tractable given our resources for SIMILE.

1/ Are we going to fix existing data, i.e. the data on hpds1.mit.edu? Some information might have been lost.

2/ To fix DSpace code will require someone on the DSpace side fixing the calls in the content management API. I'm pretty sure they're not 100% correct.

In particular, at the moment the History system stores data about every interaction during a submission. e.g. the set the title, upload the file, then go back to the metadata page and fix the title. I don't think all those states need to be in the History system; from the point of view of the History system, the item is 'created' when it is accessioned (installed, added to the collection.)

3/ I propose getting rid of the DBMS indices; it turns out they're currently broken, and in none of the situations we've talked about could they be used (dump of History data to files to import to Haystack, access via Joseki).

4/ This work is going along with the idea of having the descriptive metadata of each state stored alongside the actual temporal History system data itself. This is in keeping with what RDF is supposedly good at, but is raising risk, exposing the History system to a plethora of highly complex issues like schema versioning, future developments in DSpace metadata handling will have to take into account the History system*, possible scalability issues, making 'non-repudiation' more difficult. This places the work firmly in the research arena.

Unfortunately, we'd need a good AIP specification for DSpace to be able to proceed in the way that I'd prefer, which doesn't exist yet.

* - by this I mean that, say a future development of DSpace allows communities to deposit XML schemas for the instance data they want to deposit. How would these schemas and instance data be stored in the history system? This is tricky and would make the aforementioned development difficult to implement.

5/ Are we reinstating bundles?

6/ 2.2: No mention of Bitstream Formats.

7/ The Handles question: I don't think this is that important for the moment; it will be easy to change or overlay whatever is necessary to comply with Semantic Web standards/practices (which are a moving target.) The History system is mostly concerned with situations, i.e. Items as they were at a particular point in time, and these situations/Item states are not going to have resolvable anythings as URIs, because the History data itself is the point of reference for that situation data.

Put another way: I don't ask the History system 'what's the title of this Item'. I ask it, 'what was the title of the Item in this situation/date'. The URI of that situation is not going to be the Handle of the item. That URI does not need to be resolvable to some external data, because the situation data is right there in the History system.

8/ Follows on from 7/... I think Jason's suggestions about naming and relating Item states/situations is fine:

http://lists.w3.org/Archives/Public/www-rdf-dspace/2003May/0059.html

(though I woudn't have the'sameIndividualAs' part; there is enough data there to determine which is the latest state if necessary)

9/ What is part of a single situation? How do events that affect objects contained within other objects affect the containers?

I've asked this before and no one has responded to it. Consider a simple archive with one community, colleciton, item, bundle and bitstream:

Com1 --hasPart--> Coll1 --hasPart--> Item1
Item1 --hasPart--> Bundle1 --hasPart--> Bitstream1

A new bitstream is added to Bundle1. What has changed? What is part of the new situation? Is Item1 in a new situation or just Bundle1? Modelling even this super-simple situation turns out to be really rather complicated.

See the attached picture from my whiteboard (grainy, and I had to bolt two photos together, but hopefully you can get the gist. The blue arcs shooting off to nowhere are the metadata associated with an object, e.g. name, dc:title, dc:creator. hasRz = hasRealization.)

Along the top are the basic objects (COM1, COL1, ITEM1, BND1, BITS1). You could relate these with 'hasPart' (as shown in green) but this isn't very useful. Say ITEM1 was moved away from COL1. Would you remove the green 'hasPart' relationship? Then you wouldn't be able to tell that that ITEM1 was in COL1 at some time.

So, you have to associate hasPart arcs with the situations (for the sake of argument I've used COL1:1, ITEM1:1 to show these, connected with 'hasRz' arcs). I've shown these new hasPart arcs in red. (Lots of bits of the model are of course missing, like the actions and events.)

This looks fine, until you actually try and change something. Say I add a new bitstream, BITS2, to BND1. I've shown this in that nasty orange colour. This obviously consistutes a change in BND1; you can't just draw another hasPart arc between BND1:1 and BITS2:1, since you would never be able to tell that BND1 in one situation contained only BITS1. So, you create a new situation for BND1, called BND1:2, and have hasPart arcs between that and BITS1:1 and BITS2:1. (This assumes that the situation of BITS1 is not changed by virtue of the fact that the Bundle it is in has changed.)

But now, the state of ITEM1 has changed, since you need a new hasPart arc between ITEM1:1 and BND1:2 to indicate that ITEM1 now includes the new BND1 situation. This hasPart arc is shown with the dotted line. Now we have the same problem with ITEM1:1; it doesn't contain both BND1:1 and BND1:2 in the same situation. Really we need a new situation ITEM1:2. Then the same thing will happen for COL1, and COM1. So from this point of view, any minor change to the archive means almost the whole archive is in a new state.

This obviously isn't very tenable. Is there some aspect of Harmony I haven't seen or don't get that deals with this?

Robert Tansley / Hewlett-Packard Laboratories / (+1) 617 551 7624

Attachments

image/jpeg attachment: history-graph.jpg

Received on Thursday, 22 May 2003 11:45:16 UTC