Re: [BioRDF] Scalability from Matt Halstead on 2006-04-04 (public-semweb-lifesci@w3.org from April 2006)

From: Matt Halstead <matt.halstead@auckland.ac.nz>
Date: Wed, 5 Apr 2006 10:34:25 +1200
To: "Cutler, Roger (RogerCutler)" <RogerCutler@chevron.com>
Cc: public-semweb-lifesci@w3.org
Message-Id: <842498A8-F728-4C50-B16A-8CD6CACBC2BE@auckland.ac.nz>
I've had problems with the size of RDF graphs in memory where we are  
operating at around 1 million triples for our database; but my  
conclusions about scalability are a little different from yours, so I  
will add them here:

1) In memory representations of graphs not using a backend store is  
always going to be a problem, it's not RDF's fault. The integration  
of in memory handles to backend data stores is a more normal case for  
database development. Many RDF libraries now support such a  
mechanism, but they are still very early days and there is a lot of  
work still to be done on caching queries and the use of lazy  
evaluation for optimising these. I think this is an inherent problem  
to any database that supports sophisticated reasoning as RDF and  
derivative languages such as OWL have.

2) Parsing larger RDF/XML models, for instance the RDF/XML export of  
our database is a pain because there has not been a lot of work put  
into streaming parsers for RDF; in most cases you need at least two  
parses of an RDF/XML document to resolve late binding of resource  
identifiers or some temporary layer in between in your code. Most  
people take the shortcut and load the entire XML model into memory.  
This is an XML parsing problem, not RDF, and really, considering the  
open world nature of RDF and that we may build a context by loading  
many RDF/XML resources from different places, I think more work into  
efficient stream parsing straight into a good backend store is  
necessary.

3) Updating models. RDF Diffs are hard; well hard enough that  
implementations for them are only really beginning to get some good  
attention. So at the moment, often the best way to update an RDF  
model is to replace the entire model with the one that has changes.  
This can lead to very large and time consuming imports into back-end  
data stores just to update some minor amount of content. I have  
written various diff utilities that are quite model centric - i.e.  
assume the existence of various unique identifiers - so that means  
they aren't very RDF generalised, but the speedup in being able to  
manipulate models was huge (seconds as opposed to hours).

4) resources across the wire. I agree with Ora, compression works  
exceedingly well on XML. SVG is quite a good technology to look at in  
this respect - their XML datasets can be massive, and the user  
experience is one that requires fast turn around for XML reading and  
publishing.

5) Query and storage technologies. I find it unlikely that one  
database implementation will fulfil all our querying needs,  
especially where our RDF data are instances of and OWL model. It is  
quite likely that we will need multiple replications of our data in  
different stores with different engines in front of them, this is  
given that I think common backend data stores and schemas for  
semantic web technologies is a long way off. The replication if data  
from a primary source is obviously a scaling issue if good diff and  
update mechanisms are available to allow us to create small packets  
of information to update all the stores we need to support. I have  
already alluded to the case of non-optimised query systems above.

To me these are all quite standard problems we have dealt with in  
other systems, and it's really a matter of hard work and focus to  
bring some of these tools into a more production oriented world.

cheers
Matt


On 5/04/2006, at 4:34 AM, Cutler, Roger (RogerCutler) wrote:

>
> Somewhere down near the bottom of the lengthy thread that started  
> with a
> query about ontology editors, someone casually mentioned that 53  
> Mby of
> data that was "imported" -- from which I infer it was not binary,
> compressed data but in some sort of text format -- turned into over  
> 800
> Mby of RDF.  Frankly, a factor of 15 in size, possibly from a format
> that is fairly large to start out with, worries me.  There have since
> been some comments that sound like people think that they are going to
> deal with this by generating RDF only on-the-fly, as needed.  It seems
> to me, given the networked nature of RDF, that this is likely to have
> its own problems.  None of the solutions of which I am aware that
> actually are in operation work this way, but I will freely admit  
> that my
> experience level here is pretty low.
>
> It seems to me that there are at least three ways that one might  
> try to
> cope with this issue:
>
> 1 - Generate the RDF on-the-fly (as I said, I'm personally dubious  
> about
> this one).
>
> 2 - Make the RDF smaller somehow (maybe by making the URI's shorter, a
> al tinyurl???)
>
> 3 - Limit the amount of information that is actually put into RDF to
> some sort of descriptive metadata and keep pointers to the real data,
> which is in some other format.
>
> I think that the third approach is what I have seen done, but I get  
> the
> impression that people may not be thinking in this way in this group.
>
> I've prefaced this [BioRDF] because there has already been some
> discussion of scalability in that context and I believe that this  
> issue
> has recently been upgraded in the deliverables of this subgroup.
>
> Incidentally, what happened to the BioRDF telcons on Monday?  I was on
> vacation for a while and when I came back it didn't seem to be there.
>
>
Received on Tuesday, 4 April 2006 22:34:38 UTC