Re: [BioRDF] Scalability

Hi,

  We recently implemented RDF-based queries of BioPAX formatted pathway data
(pkb.stanford.edu) and echo the sentiments about query and storage
technologies.  In our case, scalability/performance is related to the
complexity of the query and RDF model and less on the parsing and sending
resources across the wire.

  Our current RDF store is Oracle based and has over a million triples.

Nikesh

On 4/4/06, Matt Halstead <matt.halstead@auckland.ac.nz> wrote:
>
>
> I've had problems with the size of RDF graphs in memory where we are
> operating at around 1 million triples for our database; but my
> conclusions about scalability are a little different from yours, so I
> will add them here:
>
> 1) In memory representations of graphs not using a backend store is
> always going to be a problem, it's not RDF's fault. The integration
> of in memory handles to backend data stores is a more normal case for
> database development. Many RDF libraries now support such a
> mechanism, but they are still very early days and there is a lot of
> work still to be done on caching queries and the use of lazy
> evaluation for optimising these. I think this is an inherent problem
> to any database that supports sophisticated reasoning as RDF and
> derivative languages such as OWL have.
>
> 2) Parsing larger RDF/XML models, for instance the RDF/XML export of
> our database is a pain because there has not been a lot of work put
> into streaming parsers for RDF; in most cases you need at least two
> parses of an RDF/XML document to resolve late binding of resource
> identifiers or some temporary layer in between in your code. Most
> people take the shortcut and load the entire XML model into memory.
> This is an XML parsing problem, not RDF, and really, considering the
> open world nature of RDF and that we may build a context by loading
> many RDF/XML resources from different places, I think more work into
> efficient stream parsing straight into a good backend store is
> necessary.
>
> 3) Updating models. RDF Diffs are hard; well hard enough that
> implementations for them are only really beginning to get some good
> attention. So at the moment, often the best way to update an RDF
> model is to replace the entire model with the one that has changes.
> This can lead to very large and time consuming imports into back-end
> data stores just to update some minor amount of content. I have
> written various diff utilities that are quite model centric - i.e.
> assume the existence of various unique identifiers - so that means
> they aren't very RDF generalised, but the speedup in being able to
> manipulate models was huge (seconds as opposed to hours).
>
> 4) resources across the wire. I agree with Ora, compression works
> exceedingly well on XML. SVG is quite a good technology to look at in
> this respect - their XML datasets can be massive, and the user
> experience is one that requires fast turn around for XML reading and
> publishing.
>
> 5) Query and storage technologies. I find it unlikely that one
> database implementation will fulfil all our querying needs,
> especially where our RDF data are instances of and OWL model. It is
> quite likely that we will need multiple replications of our data in
> different stores with different engines in front of them, this is
> given that I think common backend data stores and schemas for
> semantic web technologies is a long way off. The replication if data
> from a primary source is obviously a scaling issue if good diff and
> update mechanisms are available to allow us to create small packets
> of information to update all the stores we need to support. I have
> already alluded to the case of non-optimised query systems above.
>
> To me these are all quite standard problems we have dealt with in
> other systems, and it's really a matter of hard work and focus to
> bring some of these tools into a more production oriented world.
>
> cheers
> Matt
>
>
> On 5/04/2006, at 4:34 AM, Cutler, Roger (RogerCutler) wrote:
>
> >
> > Somewhere down near the bottom of the lengthy thread that started
> > with a
> > query about ontology editors, someone casually mentioned that 53
> > Mby of
> > data that was "imported" -- from which I infer it was not binary,
> > compressed data but in some sort of text format -- turned into over
> > 800
> > Mby of RDF.  Frankly, a factor of 15 in size, possibly from a format
> > that is fairly large to start out with, worries me.  There have since
> > been some comments that sound like people think that they are going to
> > deal with this by generating RDF only on-the-fly, as needed.  It seems
> > to me, given the networked nature of RDF, that this is likely to have
> > its own problems.  None of the solutions of which I am aware that
> > actually are in operation work this way, but I will freely admit
> > that my
> > experience level here is pretty low.
> >
> > It seems to me that there are at least three ways that one might
> > try to
> > cope with this issue:
> >
> > 1 - Generate the RDF on-the-fly (as I said, I'm personally dubious
> > about
> > this one).
> >
> > 2 - Make the RDF smaller somehow (maybe by making the URI's shorter, a
> > al tinyurl???)
> >
> > 3 - Limit the amount of information that is actually put into RDF to
> > some sort of descriptive metadata and keep pointers to the real data,
> > which is in some other format.
> >
> > I think that the third approach is what I have seen done, but I get
> > the
> > impression that people may not be thinking in this way in this group.
> >
> > I've prefaced this [BioRDF] because there has already been some
> > discussion of scalability in that context and I believe that this
> > issue
> > has recently been upgraded in the deliverables of this subgroup.
> >
> > Incidentally, what happened to the BioRDF telcons on Monday?  I was on
> > vacation for a while and when I came back it didn't seem to be there.
> >
> >
>
>
>


--
Nikesh Kotecha
Biomedical Informatics
Stanford University

Email: nikesh@stanford.edu

Received on Wednesday, 5 April 2006 09:16:09 UTC