Re: FW: Jena in Haystack from Dave Reynolds on 2002-07-19 (www-rdf-dspace@w3.org from July 2002)

From: Dave Reynolds <der@hplb.hpl.hp.com>
Date: Fri, 19 Jul 2002 16:21:14 +0100
To: Dennis Quan <dquan@mit.edu>
CC: "'BASS,MICK (HP-USA,ex1)'" <mick_bass@hp.com>, karger@theory.lcs.mit.edu, w3c-semweb-ad@w3.org, "'www-rdf-dspace'" <www-rdf-dspace@w3.org>
Message-ID: <3D382E6A.1D329491@hplb.hpl.hp.com>

Dennis Quan wrote:

> > As I said above you have the two choices at present in jena - use
> explicit
> > reification (in which case you can assign your MD5 URIs to the node
> which
> > reifies the stating of the statement) or use the
> > statement-referencing-a-statement shortcut in which case the stating
> is
> > effectively represented as if it were a bNode with some internal UUID
> > identifier. You could then attach your MD5 URI to this bNode via a
> > property.
> >
> > If this latter sounds like what you are looking for then I could hack
> up
> > some
> > example code.
> 
> The latter would probably not work for us because we would need to occur
> for every statement added into our database. Similarly, the former
> requires the addition of even more statements. However, perhaps the
> latter may be the easiest way to get this to work for us in the short
> term.

Fine - let me know if you still want some sketch code but it sounds like you are
on top of it. 

One thing to bear in mind is that in the case of RDB and in-memory
implementations then what seem like extra statements don't necessarily take up
much extra space due to structure sharing - the resources and literals are
shared and not duplicated. So even the full reification overhead of creating 4
statements where you expected 1 plus an ID is actually much less than a factor
of 4 in space terms - the URIs and literals are usually the big things, the
statement table is compact.

> I completely understand your reasoning for syntactically restricting the
> types of properties supported. <test> is also probably not a well-formed
> URI. However, we have a layer of abstraction above any underlying RDF
> toolkit that treats URIs as a base datatype, so I suppose I can add some
> adapter code into the jena/haystack interface layer to intelligently
> break apart the <urn:...:predicate> format by finding the last colon.

Fine. If we remove the restriction that localnames not be empty then you could
just ignore the split issue and put your entire property names in the namespace
slot.

> > If transactions for Berkeley DB are a requirement then we'd need to
> check
> > with
> > Brian (who is responsible for this sub-system) how feasible that is.
> 
> Can we follow up on this with Brian? My understanding is that the
> Berkeley DB solution may be the only one that will meet our performance
> needs.

Brian is off on much-deserved holiday for two weeks so that discussion will have
to wait a little.

In the meantime check out the performance figures I sent earlier. In my
experience MySQL, for example, is not that much slower than BDB. What I don't
know is whether it can deliver the right levels of transaction isolation and
still deliver that performance. However, this is the same for BDB - if you turn
on full transaction support for BDB then the performance hit will be non-zero
and my guess is that the performance advantage would disappear. No such thing as
a free lunch!

If MySQL can't deliver the performance you need that a write-through-caching
archicture might be worth looking at.

Dave

Received on Friday, 19 July 2002 11:21:36 UTC