FW: RE: Jena in Haystack

Forwarded to www-rdf-dspace for the record

- Mick

-----Original Message-----
From: Dave Reynolds [mailto:der@hplb.hpl.hp.com]
Sent: Monday, July 15, 2002 7:34 AM
To: karger@theory.lcs.mit.edu
Subject: Re: [dquan@mit.edu: jena evaluation]


Hi David,

> Hi Dave.  Thanks for doing that test; I forwarded to my students.
> Here's an experiment one of them tried.  The 3rd point does seem
> tricky to deal with...
...
> 1. Statement IDs. It's impossible (from what I can tell) to search for a
> statement in the database based solely on statement ID without doing a
> linear search. Our install/uninstall mechanism (among other things)
> depends on this functionality. I could in theory alter the source code
> to handle this...

Depends how you are assigning statement IDs. 

If you use XML IDs to give a statement an explicit ID then it will be
explicitly
reified and you can then retrieve that resource directly, no linear search.
However, you do suffer the explicit reification overhead. 

If you use the internal shortcut of directly refering to a statement from
another statement then the referred-to statement will effectively be treated
as
a bNode with a UUID generated as an anonymous ID. Internally that is enough
to
retrieve the statement. It is probably right that that functionality is not
exposed through the API but it would be easy enough to add it.

> 2. Predicates. For some odd reason, Jena mandates that predicate URIs
> must contain at least one slash in them so that a namespace can be
> deduced from it, allowing vocabulary/ontology management. That is
> certainly going to get in the way for us. I could also alter the source
> code to get around this one...

Not quite and it is not odd - it is part of the RDF specs. Or perhaps I
should
say it *is* odd but it is part of the specs :-) It is not directly related
to
ontology management. 

The issue is that according to the original M&S spec a predicate is
identified
by a QName (i.e. namespace plus local name part) but a predicate is also a
specialization of a resource so it should have a URI of its own. The spec
squares this circle by saying that the URI is the concatenation of the
namespace
and the localname part. Now the problem is that if you only have the URI how
do
you find the namespace/localname split? According to the RDF working group
(though nothing is fixed until they survive Last Call) the algorithm is to
work
back from the right hand end of the URI until you find the first character
which
is not legal in an XML element production - for almost all RDF namespaces
this
is '#' because localnames are introduced using fragment-ids but in some
cases
can be '/' or other characters. 

None of this is particular to jena though jena is a little unusual in trying
to
enforce this. However, the current implementation of the jena namespace slit
algorithm is not fully compliant. It will work for the standard usage
patterns
but, at least for a while, had some problems with IRIs (i.e. with unicode
characters) though that may be fixed by now.

I'm curious as to what format predicate URIs you are using and whether they
are
legal RDF.

> 3. Query synchronization. Andy Seaborne (Jena team developer) writes:
> 
> The basic rule: Don't modify the model (add or remove statements) while
> a query is executing.  Reading information is safe. The way to get round
> this is to record changes to be made in a separate data structure, such
> as recording statements to be removed in a set, then perform them after
> the query results iterator has been closed.
> 
> This one is harder to fix...

This quote is aimed more at memory models than databases I think.

The current ModelMem implementation is not designed as an in-memory database
think of it as a glorified java collection. If you update a HashMap (for
example) while an iterator is running things will break (actually you'll get
a
ConcurrentModificationException).

In the jena 2 architecture concurrency handling for in-memory models may be
improved somewhat either natively or simply because we'll have this layered
archicture so that you could wrap an update manager around a model - much
like
java's Collections.synchronizedSet approach.

However, for the actual RDB implemention we of course have a real database
so
transactions are appropriately isolated. By default we run in autocommit
mode -
so every access and update is a single transaction. However, you can use the
Model.begin()/.abort()/.commit() to batch actions together into a
transaction.
In that case we ask the database to support the TRANSACTION_READ_COMMITTED
level
of isolation. This means you should not see dirty reads.

This works straighforwardly for the raw RDF API that I reported on in my
earlier
message (listing statments matching patterns, following pointers etc). 

Now the query languages like Andy's RDQL build higher level queries out of
sequences of lower level queries so in that case it is true that if part of
query runs, then you delete some statements, then the rest of the query runs
the
second part of the query might find that some of the resources it was about
to
check have now disappeared - problem. So that is the time when that higher
level
query would need to be wrapped in a transaction. I suspect that for that to
work
safely in all cases would need at least TRANSACTION_REPEATABLE_READ and
possibly
TRANSACTION_SERIALIZABLE - off the top of my head I don't know which
database/jdbc-driver combinations support that level of transaction
isolation
and what the performance impact would be. [Actually currently the
transaction
isolation level is not trivially user settable but it would be easy to make
it
so.]

Of course if you only add statements, not delete them, it should all just
work
anyway.

One final point on this is how much of a problem any of this is for you
depends
on your overall architecture. In our ePerson work, as I was saying on the
telecon, we run our RDF data sources as separate servers and the clients
pull
back data in largish blocks and process locally. The data source API we use
means that queries and updates are packaged as single units anyway - we can
can
transparently switch between in-memory, file based and database-backed
server
implementations trivially and still maintain consistency.

Dave

Received on Friday, 19 July 2002 05:29:10 UTC