W3C home > Mailing lists > Public > public-semweb-lifesci@w3.org > April 2006

RE: [BioRDF] Scalability

From: Cutler, Roger (RogerCutler) <RogerCutler@chevron.com>
Date: Thu, 6 Apr 2006 12:42:32 -0500
Message-ID: <0C237C50B244FD44BE47B8DCE23A3052011C6449@HOU150NTXC2MC.hou150.chevrontexaco.net>
To: "Susie Stephens" <susie.stephens@oracle.com>, public-semweb-lifesci@w3.org

No problem.  Getting back to the main subject of the thread, I'm a
little curious whether you've got some Oracle perspective on this issue.
I understand that new Oracle databases are putting RDF into some sort of
triple-store, but I don't know much about the details.  Some questions
that occur to me, but maybe not exactly the right questions:

- Does the RDF just go in as-is or is it compressed in some way?  If
there is a size factor of something like 15 from the data itself, are
these RDF stores tending to be real bulky?

- Is there some sort of indexing and related join-like function?  If so,
what are the performance characteristics?

As I said, I don't have any experience with the RDF stuff, but some
thoughts based on my experience with relational databases:

- Just because you've got your data in an Oracle (or any other) database
doesn't mean you are going to be able to get at it in a performant
manner.  The devil is in the details.

- Operations that initiate a full read of a Gigabyte database are
extremely painful.

- Big joins can also be extremely painful.  Would traversing a big bunch
of RDF look something like an incredibly complex hairball of complex
joins?  If so, is there a potential problem here?

-----Original Message-----
From: Susie Stephens [mailto:susie.stephens@oracle.com] 
Sent: Wednesday, April 05, 2006 5:47 PM
To: Cutler, Roger (RogerCutler)
Subject: Re: [BioRDF] Scalability


We didn't have a BioRDF call this week, as it clashed with Bio-IT World.

This was posted on the Wiki



Cutler, Roger (RogerCutler) wrote:

>Somewhere down near the bottom of the lengthy thread that started with 
>a query about ontology editors, someone casually mentioned that 53 Mby 
>of data that was "imported" -- from which I infer it was not binary, 
>compressed data but in some sort of text format -- turned into over 800

>Mby of RDF.  Frankly, a factor of 15 in size, possibly from a format 
>that is fairly large to start out with, worries me.  There have since 
>been some comments that sound like people think that they are going to 
>deal with this by generating RDF only on-the-fly, as needed.  It seems 
>to me, given the networked nature of RDF, that this is likely to have 
>its own problems.  None of the solutions of which I am aware that 
>actually are in operation work this way, but I will freely admit that 
>my experience level here is pretty low.
>It seems to me that there are at least three ways that one might try to

>cope with this issue:
>1 - Generate the RDF on-the-fly (as I said, I'm personally dubious 
>about this one).
>2 - Make the RDF smaller somehow (maybe by making the URI's shorter, a 
>al tinyurl???)
>3 - Limit the amount of information that is actually put into RDF to 
>some sort of descriptive metadata and keep pointers to the real data, 
>which is in some other format.
>I think that the third approach is what I have seen done, but I get the

>impression that people may not be thinking in this way in this group.
>I've prefaced this [BioRDF] because there has already been some 
>discussion of scalability in that context and I believe that this issue

>has recently been upgraded in the deliverables of this subgroup.
>Incidentally, what happened to the BioRDF telcons on Monday?  I was on 
>vacation for a while and when I came back it didn't seem to be there.
Received on Thursday, 6 April 2006 17:43:49 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:52:25 UTC