Pre-reading for 05/24 RDFI phone conference

I asked Peter Breton to prepare some background for our phone conference 
tomorrow so that we are not starting with a blank page, and so you can 
synchronize with our current thinking and understand what we've already 
investigated.  I've included Peter's set of "grounding statements" 
below.  Please review them in advance of the phone conference.

This makes our use of time as follows:

I.	Brief review of Goal and motivation for achieving The Goal.
II. 	Review approaches 1a, 1b, 2 below.  Consensus on 1a?
III.	Approach strategy discussion (as required)
IV.	Implementation discussion (for selected approach)
V.	Automation strategies for RDF generation

We have only an hour, so I will take the liberty of moderating ruthlessly.

I'd like to thank Peter for this prep material and each of you for 
contributing to the discussion.

Once again, logistics:

7:30 - 8:30 am US/Pacific, 10:30 - 11:30 US/Eastern , 3:30 - 4:30 GMT
Meet-Me: 404-774-4109  or TN 774-4109
Meeting ID: 6116

IRC Server / Channel: irc.openprojects.net / #dspace

- Mick

==============================================================

Date: Wed, 23 May 2001 19:15:41 -0400
From: Peter Breton <pbreton@MIT.EDU>
Organization: MIT
User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.14-12 i686; en-US; rv:0.9) 
Gecko/20010505
X-Accept-Language: en, de, en-gb, fr-ch, sl
To: Mick Bass <bass@mit.edu>
Subject: Grounding statements for phone conference tomorrow


The Platform
==========
DSpace is currently implemented with Java, JSP/Servlets and a Postgres 
backend. Approaches and toolkits which are outside this domain will 
probably be mainly useful for architecture and concept mining.


The Goal
=======
Implement a (probably persistent RDBMS-backed) triple-store for DSpace 
data. The triple store must be outside the critical path; it should be 
possible to use DSpace without the triple store without any significant 
loss of functionality.


Approaches (this is not an exhaustive list):
===============================


1). Add-on triple store


a. Via Java code. In this scenario, when DSpace obtains new data (eg, a new 
Publication is submitted), it makes this information available to an RDF 
storage module.  The storage module somehow (?) creates RDF triples and 
persists them (in addition to any other storage for these objects).
I expect that most significant actions in the system will cause 
corresponding (Java Bean) Events to be fired, with in-memory 
representations of the objects; so actually connecting the RDF storage 
module with the data to be persisted should be straightforward.


b. Via database triggers. In this RDBMS-centric approach, whenever a row is 
inserted or modified in the RDBMS, a trigger fires which creates 
corresponding triples in some other tables. The triple tables and the 
ordinary mainstream tables are otherwise completely separate. 
Implementation here would focus on creating the triggers.


While approach b) has some nice features (transparent operation and 
automatic synchronization, to name two), our current feeling is that it is 
too limiting, both to fit all data that can be RDFed into the Procrustean 
bed of relational tables and to write the synchronization logic in an RDBMS 
language.


2) Virtual triple store


Similar to 1), except that the redundant triple store is not actually 
physically created; instead, we only create logic that can transform our 
data into RDF triples (for export and the like). This essentially pushes 
the problem over to someone else, who can use the RDF data to create their 
own triple store if they like.


Note that while there is a lot of overlap with 1), the virtual triple store 
problem is much easier, since it need not consider persistence; it only 
needs to do half of the job.


Approaches II
==========
I currently favor approach 1a), basically because:


* I think a modern programming language is the right place to put 
potentially complex mapping logic
* I think DSpace will get farther with a real, honest-to-gosh triple store 
than just a strategy for generating one


I would suggest that if there is general consensus on this approach, that 
we move directly to a discussion of how to implement such a beast. 
Otherwise, we should have a discussion of which approach to pursue.


Automating of RDF generation
=======================
I would also like to discuss ways to minimize the effort necessary to 
capture triples.


Two promising areas are:


1. Creation of triples from RDBMS column and foreign key relationships
2. Creation of triples from Java Beans


I think the second of these would work something like this: let's say that 
our hypothetical RDF storage module gets a PublicationSubmittedEvent. The 
event contains, as data fields, the new Publication, the User who submitted 
it, the time the event occurred, and the User's session.


The storage module could create a number of relationships based on this 
information -- for example, the language of said User was German; the User 
logged in from www.bluewin.de; the Publication has Author "Hans Gretel", 
and so forth. Some of these relationships would need to be characterized: 
for example, it's not clear how the user and publication are related, so we 
might add an annotation that says that, for PublicationSubmittedEvents, the 
relationship between the two is "submittedBy". However, it would be nice if 
most of this information could be directly derived from the object; when a 
Publication has a "title" property (using Java Bean speak), the system can 
automatically construct: publication with id 141 has title "Harry Potter 
and the Sorceror's Stone".  Basically, as much as possible you indicate 
your intentions by simply DOING (in this case, creating an object with a 
title property); but if for some reason you must do one thing but mean 
something different, you add an annotation which expresses your real meaning.


Toolkits we've looked at (in various degrees of depth):
=======================================


* Brian McBride's Jena (and various Jena extensions) -- nice code, Brian!
* Eric P's RDF db store (overview only -- haven't poked around in the code)
* OCLC/Dublin Core's EOR (actually, I just started reading the docs :)
* Survey of storing RDF in an RDBMS
* RDF SiRPAC API


Thanks again for lending your time,


Peter
=============================================
Mick Bass, Sloan MOT 2000

R&D Project Manager, Hewlett-Packard Company
Building 10-500 MIT, 77 Massachusetts Avenue
Cambridge, MA 02139-4307

617.253.6617 office    617.452.3000 fax
617.899.3938 mobile    617.627.9694 residence
bass@alum.mit.edu      mick_bass@hp.com
=============================================

Received on Wednesday, 23 May 2001 21:16:38 UTC