- From: Dave Reynolds <der@hplb.hpl.hp.com>
- Date: Fri, 19 Jul 2002 10:22:46 +0100
- To: www-rdf-dspace <www-rdf-dspace@w3.org>
Prior response also forwarded to www-rdf-dspace for the record. Dave -------- Original Message -------- Subject: Re: [dquan@mit.edu: jena evaluation] Date: Mon, 15 Jul 2002 12:34:12 +0100 From: Dave Reynolds <der@hplb.hpl.hp.com> Organization: Hewlett-Packard Laboratories To: karger@theory.lcs.mit.edu References: <200207121703.g6CH3HS16384@harrier.lcs.mit.edu> Hi David, > Hi Dave. Thanks for doing that test; [snip] > 1. Statement IDs. It's impossible (from what I can tell) to search for a > statement in the database based solely on statement ID without doing a > linear search. Our install/uninstall mechanism (among other things) > depends on this functionality. I could in theory alter the source code > to handle this... Depends how you are assigning statement IDs. If you use XML IDs to give a statement an explicit ID then it will be explicitly reified and you can then retrieve that resource directly, no linear search. However, you do suffer the explicit reification overhead. If you use the internal shortcut of directly refering to a statement from another statement then the referred-to statement will effectively be treated as a bNode with a UUID generated as an anonymous ID. Internally that is enough to retrieve the statement. It is probably right that that functionality is not exposed through the API but it would be easy enough to add it. > 2. Predicates. For some odd reason, Jena mandates that predicate URIs > must contain at least one slash in them so that a namespace can be > deduced from it, allowing vocabulary/ontology management. That is > certainly going to get in the way for us. I could also alter the source > code to get around this one... Not quite and it is not odd - it is part of the RDF specs. Or perhaps I should say it *is* odd but it is part of the specs :-) It is not directly related to ontology management. The issue is that according to the original M&S spec a predicate is identified by a QName (i.e. namespace plus local name part) but a predicate is also a specialization of a resource so it should have a URI of its own. The spec squares this circle by saying that the URI is the concatenation of the namespace and the localname part. Now the problem is that if you only have the URI how do you find the namespace/localname split? According to the RDF working group (though nothing is fixed until they survive Last Call) the algorithm is to work back from the right hand end of the URI until you find the first character which is not legal in an XML element production - for almost all RDF namespaces this is '#' because localnames are introduced using fragment-ids but in some cases can be '/' or other characters. None of this is particular to jena though jena is a little unusual in trying to enforce this. However, the current implementation of the jena namespace slit algorithm is not fully compliant. It will work for the standard usage patterns but, at least for a while, had some problems with IRIs (i.e. with unicode characters) though that may be fixed by now. I'm curious as to what format predicate URIs you are using and whether they are legal RDF. > 3. Query synchronization. Andy Seaborne (Jena team developer) writes: > > The basic rule: Don't modify the model (add or remove statements) while > a query is executing. Reading information is safe. The way to get round > this is to record changes to be made in a separate data structure, such > as recording statements to be removed in a set, then perform them after > the query results iterator has been closed. > > This one is harder to fix... This quote is aimed more at memory models than databases I think. The current ModelMem implementation is not designed as an in-memory database think of it as a glorified java collection. If you update a HashMap (for example) while an iterator is running things will break (actually you'll get a ConcurrentModificationException). In the jena 2 architecture concurrency handling for in-memory models may be improved somewhat either natively or simply because we'll have this layered archicture so that you could wrap an update manager around a model - much like java's Collections.synchronizedSet approach. However, for the actual RDB implemention we of course have a real database so transactions are appropriately isolated. By default we run in autocommit mode - so every access and update is a single transaction. However, you can use the Model.begin()/.abort()/.commit() to batch actions together into a transaction. In that case we ask the database to support the TRANSACTION_READ_COMMITTED level of isolation. This means you should not see dirty reads. This works straighforwardly for the raw RDF API that I reported on in my earlier message (listing statments matching patterns, following pointers etc). Now the query languages like Andy's RDQL build higher level queries out of sequences of lower level queries so in that case it is true that if part of query runs, then you delete some statements, then the rest of the query runs the second part of the query might find that some of the resources it was about to check have now disappeared - problem. So that is the time when that higher level query would need to be wrapped in a transaction. I suspect that for that to work safely in all cases would need at least TRANSACTION_REPEATABLE_READ and possibly TRANSACTION_SERIALIZABLE - off the top of my head I don't know which database/jdbc-driver combinations support that level of transaction isolation and what the performance impact would be. [Actually currently the transaction isolation level is not trivially user settable but it would be easy to make it so.] Of course if you only add statements, not delete them, it should all just work anyway. One final point on this is how much of a problem any of this is for you depends on your overall architecture. In our ePerson work, as I was saying on the telecon, we run our RDF data sources as separate servers and the clients pull back data in largish blocks and process locally. The data source API we use means that queries and updates are packaged as single units anyway - we can can transparently switch between in-memory, file based and database-backed server implementations trivially and still maintain consistency. Dave
Received on Friday, 19 July 2002 05:23:21 UTC