RE: [BioRDF] Scalability from Cutler, Roger (RogerCutler) on 2006-04-06 (public-semweb-lifesci@w3.org from April 2006)

From: Cutler, Roger (RogerCutler) <RogerCutler@chevron.com>
Date: Thu, 6 Apr 2006 12:42:32 -0500
To: "Susie Stephens" <susie.stephens@oracle.com>, public-semweb-lifesci@w3.org
Message-ID: <0C237C50B244FD44BE47B8DCE23A3052011C6449@HOU150NTXC2MC.hou150.chevrontexaco.net>

No problem.  Getting back to the main subject of the thread, I'm a
little curious whether you've got some Oracle perspective on this issue.
I understand that new Oracle databases are putting RDF into some sort of
triple-store, but I don't know much about the details.  Some questions
that occur to me, but maybe not exactly the right questions:

- Does the RDF just go in as-is or is it compressed in some way?  If
there is a size factor of something like 15 from the data itself, are
these RDF stores tending to be real bulky?

- Is there some sort of indexing and related join-like function?  If so,
what are the performance characteristics?

As I said, I don't have any experience with the RDF stuff, but some
thoughts based on my experience with relational databases:

- Just because you've got your data in an Oracle (or any other) database
doesn't mean you are going to be able to get at it in a performant
manner.  The devil is in the details.

- Operations that initiate a full read of a Gigabyte database are
extremely painful.

- Big joins can also be extremely painful.  Would traversing a big bunch
of RDF look something like an incredibly complex hairball of complex
joins?  If so, is there a potential problem here?

-----Original Message-----
From: Susie Stephens [mailto:susie.stephens@oracle.com] 
Sent: Wednesday, April 05, 2006 5:47 PM
To: Cutler, Roger (RogerCutler)
Subject: Re: [BioRDF] Scalability

Roger,

We didn't have a BioRDF call this week, as it clashed with Bio-IT World.

This was posted on the Wiki
(http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup).

Cheers,

Susie



Cutler, Roger (RogerCutler) wrote:

>Somewhere down near the bottom of the lengthy thread that started with 
>a query about ontology editors, someone casually mentioned that 53 Mby 
>of data that was "imported" -- from which I infer it was not binary, 
>compressed data but in some sort of text format -- turned into over 800

>Mby of RDF.  Frankly, a factor of 15 in size, possibly from a format 
>that is fairly large to start out with, worries me.  There have since 
>been some comments that sound like people think that they are going to 
>deal with this by generating RDF only on-the-fly, as needed.  It seems 
>to me, given the networked nature of RDF, that this is likely to have 
>its own problems.  None of the solutions of which I am aware that 
>actually are in operation work this way, but I will freely admit that 
>my experience level here is pretty low.
>
>It seems to me that there are at least three ways that one might try to

>cope with this issue:
>
>1 - Generate the RDF on-the-fly (as I said, I'm personally dubious 
>about this one).
>
>2 - Make the RDF smaller somehow (maybe by making the URI's shorter, a 
>al tinyurl???)
>
>3 - Limit the amount of information that is actually put into RDF to 
>some sort of descriptive metadata and keep pointers to the real data, 
>which is in some other format.
>
>I think that the third approach is what I have seen done, but I get the

>impression that people may not be thinking in this way in this group.
>
>I've prefaced this [BioRDF] because there has already been some 
>discussion of scalability in that context and I believe that this issue

>has recently been upgraded in the deliverables of this subgroup.
>
>Incidentally, what happened to the BioRDF telcons on Monday?  I was on 
>vacation for a while and when I came back it didn't seem to be there.
>
>
>  
>

Received on Thursday, 6 April 2006 17:43:49 UTC