[Fwd: Re: [dquan@mit.edu: jena evaluation]]

Prior response also forwarded to www-rdf-dspace for the record.
Dave

-------- Original Message --------
Subject: Re: [dquan@mit.edu: jena evaluation]
Date: Mon, 15 Jul 2002 12:34:12 +0100
From: Dave Reynolds <der@hplb.hpl.hp.com>
Organization: Hewlett-Packard Laboratories
To: karger@theory.lcs.mit.edu
References: <200207121703.g6CH3HS16384@harrier.lcs.mit.edu>

Hi David,

> Hi Dave.  Thanks for doing that test; 

[snip]

> 1. Statement IDs. It's impossible (from what I can tell) to search for a
> statement in the database based solely on statement ID without doing a
> linear search. Our install/uninstall mechanism (among other things)
> depends on this functionality. I could in theory alter the source code
> to handle this...

Depends how you are assigning statement IDs. 

If you use XML IDs to give a statement an explicit ID then it will be explicitly
reified and you can then retrieve that resource directly, no linear search.
However, you do suffer the explicit reification overhead. 

If you use the internal shortcut of directly refering to a statement from
another statement then the referred-to statement will effectively be treated as
a bNode with a UUID generated as an anonymous ID. Internally that is enough to
retrieve the statement. It is probably right that that functionality is not
exposed through the API but it would be easy enough to add it.

> 2. Predicates. For some odd reason, Jena mandates that predicate URIs
> must contain at least one slash in them so that a namespace can be
> deduced from it, allowing vocabulary/ontology management. That is
> certainly going to get in the way for us. I could also alter the source
> code to get around this one...

Not quite and it is not odd - it is part of the RDF specs. Or perhaps I should
say it *is* odd but it is part of the specs :-) It is not directly related to
ontology management. 

The issue is that according to the original M&S spec a predicate is identified
by a QName (i.e. namespace plus local name part) but a predicate is also a
specialization of a resource so it should have a URI of its own. The spec
squares this circle by saying that the URI is the concatenation of the namespace
and the localname part. Now the problem is that if you only have the URI how do
you find the namespace/localname split? According to the RDF working group
(though nothing is fixed until they survive Last Call) the algorithm is to work
back from the right hand end of the URI until you find the first character which
is not legal in an XML element production - for almost all RDF namespaces this
is '#' because localnames are introduced using fragment-ids but in some cases
can be '/' or other characters. 

None of this is particular to jena though jena is a little unusual in trying to
enforce this. However, the current implementation of the jena namespace slit
algorithm is not fully compliant. It will work for the standard usage patterns
but, at least for a while, had some problems with IRIs (i.e. with unicode
characters) though that may be fixed by now.

I'm curious as to what format predicate URIs you are using and whether they are
legal RDF.

> 3. Query synchronization. Andy Seaborne (Jena team developer) writes:
> 
> The basic rule: Don't modify the model (add or remove statements) while
> a query is executing.  Reading information is safe. The way to get round
> this is to record changes to be made in a separate data structure, such
> as recording statements to be removed in a set, then perform them after
> the query results iterator has been closed.
> 
> This one is harder to fix...

This quote is aimed more at memory models than databases I think.

The current ModelMem implementation is not designed as an in-memory database
think of it as a glorified java collection. If you update a HashMap (for
example) while an iterator is running things will break (actually you'll get a
ConcurrentModificationException).

In the jena 2 architecture concurrency handling for in-memory models may be
improved somewhat either natively or simply because we'll have this layered
archicture so that you could wrap an update manager around a model - much like
java's Collections.synchronizedSet approach.

However, for the actual RDB implemention we of course have a real database so
transactions are appropriately isolated. By default we run in autocommit mode -
so every access and update is a single transaction. However, you can use the
Model.begin()/.abort()/.commit() to batch actions together into a transaction.
In that case we ask the database to support the TRANSACTION_READ_COMMITTED level
of isolation. This means you should not see dirty reads.

This works straighforwardly for the raw RDF API that I reported on in my earlier
message (listing statments matching patterns, following pointers etc). 

Now the query languages like Andy's RDQL build higher level queries out of
sequences of lower level queries so in that case it is true that if part of
query runs, then you delete some statements, then the rest of the query runs the
second part of the query might find that some of the resources it was about to
check have now disappeared - problem. So that is the time when that higher level
query would need to be wrapped in a transaction. I suspect that for that to work
safely in all cases would need at least TRANSACTION_REPEATABLE_READ and possibly
TRANSACTION_SERIALIZABLE - off the top of my head I don't know which
database/jdbc-driver combinations support that level of transaction isolation
and what the performance impact would be. [Actually currently the transaction
isolation level is not trivially user settable but it would be easy to make it
so.]

Of course if you only add statements, not delete them, it should all just work
anyway.

One final point on this is how much of a problem any of this is for you depends
on your overall architecture. In our ePerson work, as I was saying on the
telecon, we run our RDF data sources as separate servers and the clients pull
back data in largish blocks and process locally. The data source API we use
means that queries and updates are packaged as single units anyway - we can can
transparently switch between in-memory, file based and database-backed server
implementations trivially and still maintain consistency.

Dave

Received on Friday, 19 July 2002 05:23:21 UTC