RE: Jena database performance from Brian McBride on 2002-09-04 (www-rdf-dspace@w3.org from September 2002)

From: Brian McBride <bwm@hplb.hpl.hp.com>
Date: Wed, 04 Sep 2002 12:26:10 +0100
To: "Dennis Quan" <dquan@mit.edu>, "'Dave Reynolds'" <der@hplb.hpl.hp.com>
Cc: <www-rdf-dspace@w3.org>
Message-Id: <5.1.0.14.0.20020904121513.02882938@0-mail-1.hpl.hp.com>

Dennis,

I'm a bit surprised, but yes it does.  It uses the writeUTF call to a 
datastream which is restricted to 64Kbyte strings :(

The fix is both hard and easy.  The code change is trivial, but the trivial 
change will be incompatible with existing databases.

Is this a big deal for you.  The BDB implemenation isn't well set up for 
such large literals; it will store each one three times.

Brian


At 15:39 03/09/2002 -0400, Dennis Quan wrote:

>Hi Dave,
>
>I have not investigated this too deeply, but it appears that there is a
>64 kilobyte restriction on the length of literals in the Berkeley
>DB-backed Jena implementation. I have observed that the code is throwing
>a java.io.UTFDataFormatError, which is thrown for this reason. If this
>is a limitation, are there any plans to remove it?
>
>Thanks,
>Dennis
>
> > -----Original Message-----
> > From: www-rdf-dspace-request@w3.org
>[mailto:www-rdf-dspace-request@w3.org]
> > On Behalf Of Dave Reynolds
> > Sent: Friday, August 09, 2002 9:54 AM
> > To: karger@theory.lcs.mit.edu
> > Cc: www-rdf-dspace@w3.org; Nick_Wainwright@HPLB.HPL.HP.COM;
> > dquan@theory.lcs.mit.edu
> > Subject: Re: Jena database performance
> >
> >
> > Hi David,
> >
> > > My intuition tells me that the right cache for our application is a
> > > "graph cache"---namely, a set of resources and the relations
>incident
> > > on those resources.
> > >
> > >    Also could you provide more details on how those queries are
> > >    generated and then sent to the store?
> > >
> > > This intuition follows from the idea that most of
> > > the queries being issues are of the form "now that I have object X,
> > > give me the resource at the other end of predicate P from X".  For
> > > example, "now that I am holding object X and want to display it,
> > > lookup X.type.  Now that I have T=X.type, find an element that can
>be
> > > used to display T by finding T.viewers.  etc."   In the presence of
>an
> > > LRU cache, this would naturally over time cache all the data types
> > > (not very many) and all the viewer elements for those types (also
>not
> > > very many).
> >
> > Understood. That seems like a good intuition. What would be the
>easiest
> > way to
> > get statistics or example data to check it out?
> >
> > FYI In our eperson work the application does analagous things, in our
>case
> > we
> > put the pointer chasing into a single query, for example:
> >   X rdf:type [ex:viewer []]; * [].
> > brings back all the properties of X, including its rdf:type and for
>its
> > rdf:type
> > brings back the viewer object. This is one query, over the network,
>which
> > brings
> > back a whole bunch of RDF statements which the client app can then
>pull
> > apart.
> > Though in fact in our case the type-to-viewer mapping is done using a
> > display-policy expressed as an RDF graph that we can retrieve all of
>in
> > one
> > query at client startup.
> >
> > The cost of this is that the client application has to be written so
>as to
> > exploit these batch queries, essentially we are doing app specific
>caching.
> > The
> > advantage is that the store has explicit information on the access
>paterns
> > which
> > could be used for cache management. A generic cache that worked well
> > enough with
> > just implicit inferred access patterns would simplify some of the
>client
> > code
> > and would be of general use.
> >
> > I'll be out of email contact for the next two weeks but would like to
> > follow
> > this up more after I return.
> >
> > Dave

Received on Wednesday, 4 September 2002 07:33:00 UTC