RE: Pre-reading for 05/24 RDFI phone conference [2]

[Peter Breton]
I currently favor:
1). Add-on triple store
a. Via Java code. In this scenario, when DSpace obtains new data (eg, a new
Publication is submitted), it makes this information available to an RDF
storage module.  The storage module somehow (?) creates RDF triples and
persists them (in addition to any other storage for these objects).

[wmj]
It seems worthwhile to consider the data model described in "The Associative
Model of Data' by Simon Williams - ISBN 1-903453-00-3 and used as backbone
for SENTENCES [http://www.lazysoft.com/]
Here is a comparison with triple store copied from the book:
"Under the associative model, the 3-tuples in essence become 4-tuples, where
the first column is the identity of the tuple itself:
Mary Jones lives in Britain
Amazon sells Dr No
    ... for  L9.50

Name table
----------
67 	Mary Jones
90	lives in
14	Britain
23	Amazon
38	sells
41	Dr No
53	for
98	L9.50

Associations
------------
71	67	90	14
77	23	38	41
03	77	53	98
"

Best

WMJ

-----Original Message-----
From: www-rdf-dspace-request@w3.org
[mailto:www-rdf-dspace-request@w3.org]On Behalf Of Mick Bass
Sent: Wednesday, May 23, 2001 9:16 PM
To: rdfi@cally.hpl.hp.com; www-rdf-dspace@w3.org
Cc: nick_wainwright@hp.com
Subject: Pre-reading for 05/24 RDFI phone conference


I asked Peter Breton to prepare some background for our phone conference
tomorrow so that we are not starting with a blank page, and so you can
synchronize with our current thinking and understand what we've already
investigated.  I've included Peter's set of "grounding statements"
below.  Please review them in advance of the phone conference.

This makes our use of time as follows:

I.	Brief review of Goal and motivation for achieving The Goal.
II. 	Review approaches 1a, 1b, 2 below.  Consensus on 1a?
III.	Approach strategy discussion (as required)
IV.	Implementation discussion (for selected approach)
V.	Automation strategies for RDF generation

We have only an hour, so I will take the liberty of moderating ruthlessly.

I'd like to thank Peter for this prep material and each of you for
contributing to the discussion.

Once again, logistics:

7:30 - 8:30 am US/Pacific, 10:30 - 11:30 US/Eastern , 3:30 - 4:30 GMT
Meet-Me: 404-774-4109  or TN 774-4109
Meeting ID: 6116

IRC Server / Channel: irc.openprojects.net / #dspace

- Mick

==============================================================

Date: Wed, 23 May 2001 19:15:41 -0400
From: Peter Breton <pbreton@MIT.EDU>
Organization: MIT
User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.14-12 i686; en-US; rv:0.9)
Gecko/20010505
X-Accept-Language: en, de, en-gb, fr-ch, sl
To: Mick Bass <bass@mit.edu>
Subject: Grounding statements for phone conference tomorrow


The Platform
==========
DSpace is currently implemented with Java, JSP/Servlets and a Postgres
backend. Approaches and toolkits which are outside this domain will
probably be mainly useful for architecture and concept mining.


The Goal
=======
Implement a (probably persistent RDBMS-backed) triple-store for DSpace
data. The triple store must be outside the critical path; it should be
possible to use DSpace without the triple store without any significant
loss of functionality.


Approaches (this is not an exhaustive list):
===============================


1). Add-on triple store


a. Via Java code. In this scenario, when DSpace obtains new data (eg, a new
Publication is submitted), it makes this information available to an RDF
storage module.  The storage module somehow (?) creates RDF triples and
persists them (in addition to any other storage for these objects).
I expect that most significant actions in the system will cause
corresponding (Java Bean) Events to be fired, with in-memory
representations of the objects; so actually connecting the RDF storage
module with the data to be persisted should be straightforward.


b. Via database triggers. In this RDBMS-centric approach, whenever a row is
inserted or modified in the RDBMS, a trigger fires which creates
corresponding triples in some other tables. The triple tables and the
ordinary mainstream tables are otherwise completely separate.
Implementation here would focus on creating the triggers.


While approach b) has some nice features (transparent operation and
automatic synchronization, to name two), our current feeling is that it is
too limiting, both to fit all data that can be RDFed into the Procrustean
bed of relational tables and to write the synchronization logic in an RDBMS
language.


2) Virtual triple store


Similar to 1), except that the redundant triple store is not actually
physically created; instead, we only create logic that can transform our
data into RDF triples (for export and the like). This essentially pushes
the problem over to someone else, who can use the RDF data to create their
own triple store if they like.


Note that while there is a lot of overlap with 1), the virtual triple store
problem is much easier, since it need not consider persistence; it only
needs to do half of the job.


Approaches II
==========
I currently favor approach 1a), basically because:


* I think a modern programming language is the right place to put
potentially complex mapping logic
* I think DSpace will get farther with a real, honest-to-gosh triple store
than just a strategy for generating one


I would suggest that if there is general consensus on this approach, that
we move directly to a discussion of how to implement such a beast.
Otherwise, we should have a discussion of which approach to pursue.


Automating of RDF generation
=======================
I would also like to discuss ways to minimize the effort necessary to
capture triples.


Two promising areas are:


1. Creation of triples from RDBMS column and foreign key relationships
2. Creation of triples from Java Beans


I think the second of these would work something like this: let's say that
our hypothetical RDF storage module gets a PublicationSubmittedEvent. The
event contains, as data fields, the new Publication, the User who submitted
it, the time the event occurred, and the User's session.


The storage module could create a number of relationships based on this
information -- for example, the language of said User was German; the User
logged in from www.bluewin.de; the Publication has Author "Hans Gretel",
and so forth. Some of these relationships would need to be characterized:
for example, it's not clear how the user and publication are related, so we
might add an annotation that says that, for PublicationSubmittedEvents, the
relationship between the two is "submittedBy". However, it would be nice if
most of this information could be directly derived from the object; when a
Publication has a "title" property (using Java Bean speak), the system can
automatically construct: publication with id 141 has title "Harry Potter
and the Sorceror's Stone".  Basically, as much as possible you indicate
your intentions by simply DOING (in this case, creating an object with a
title property); but if for some reason you must do one thing but mean
something different, you add an annotation which expresses your real
meaning.


Toolkits we've looked at (in various degrees of depth):
=======================================


* Brian McBride's Jena (and various Jena extensions) -- nice code, Brian!
* Eric P's RDF db store (overview only -- haven't poked around in the code)
* OCLC/Dublin Core's EOR (actually, I just started reading the docs :)
* Survey of storing RDF in an RDBMS
* RDF SiRPAC API


Thanks again for lending your time,


Peter
=============================================
Mick Bass, Sloan MOT 2000

R&D Project Manager, Hewlett-Packard Company
Building 10-500 MIT, 77 Massachusetts Avenue
Cambridge, MA 02139-4307

617.253.6617 office    617.452.3000 fax
617.899.3938 mobile    617.627.9694 residence
bass@alum.mit.edu      mick_bass@hp.com
=============================================

Received on Thursday, 24 May 2001 01:24:33 UTC