RE: [BioRDF] Scalability from Miller, Michael D (Rosetta) on 2006-04-04 (public-semweb-lifesci@w3.org from April 2006)

From: Miller, Michael D (Rosetta) <Michael_Miller@Rosettabio.com>
Date: Tue, 4 Apr 2006 09:51:44 -0700
To: "Cutler, Roger (RogerCutler)" <RogerCutler@chevron.com>, public-semweb-lifesci@w3.org
Message-ID: <E1FQolB-0008Iv-3Q@lisa.w3.org>

Hi Roger,

I believe I can provide some comfort for the scalability issue with our
experience with MAGE-ML.

One thing that greatly alleviates the problem is to use compress
writers/readers (Java provides nice ones), for regularly formatted XML
this can compress to 2-10% the original size.

> 3 - Limit the amount of information that is actually put into RDF to
> some sort of descriptive metadata and keep pointers to the real data,
> which is in some other format.

MAGE-ML has the ability to reference the external data from the
microarray feature extractor software and this has worked well.  Also,
the information can be broken into several files and use references in
one file to the actual definition in another file.

But, surprisingly, even uncompressed large XML files have not been all
that much of an issue.

cheers,
Michael

> -----Original Message-----
> From: public-semweb-lifesci-request@w3.org 
> [mailto:public-semweb-lifesci-request@w3.org] On Behalf Of 
> Cutler, Roger (RogerCutler)
> Sent: Tuesday, April 04, 2006 9:35 AM
> To: public-semweb-lifesci@w3.org
> Subject: [BioRDF] Scalability
> 
> 
> 
> Somewhere down near the bottom of the lengthy thread that 
> started with a
> query about ontology editors, someone casually mentioned that 
> 53 Mby of
> data that was "imported" -- from which I infer it was not binary,
> compressed data but in some sort of text format -- turned 
> into over 800
> Mby of RDF.  Frankly, a factor of 15 in size, possibly from a format
> that is fairly large to start out with, worries me.  There have since
> been some comments that sound like people think that they are going to
> deal with this by generating RDF only on-the-fly, as needed.  It seems
> to me, given the networked nature of RDF, that this is likely to have
> its own problems.  None of the solutions of which I am aware that
> actually are in operation work this way, but I will freely 
> admit that my
> experience level here is pretty low.
> 
> It seems to me that there are at least three ways that one 
> might try to
> cope with this issue:
> 
> 1 - Generate the RDF on-the-fly (as I said, I'm personally 
> dubious about
> this one).
> 
> 2 - Make the RDF smaller somehow (maybe by making the URI's shorter, a
> al tinyurl???)
> 
> 3 - Limit the amount of information that is actually put into RDF to
> some sort of descriptive metadata and keep pointers to the real data,
> which is in some other format.
> 
> I think that the third approach is what I have seen done, but 
> I get the
> impression that people may not be thinking in this way in this group.
> 
> I've prefaced this [BioRDF] because there has already been some
> discussion of scalability in that context and I believe that 
> this issue
> has recently been upgraded in the deliverables of this subgroup.
> 
> Incidentally, what happened to the BioRDF telcons on Monday?  I was on
> vacation for a while and when I came back it didn't seem to be there.
> 
> 
> 
>

Received on Tuesday, 4 April 2006 16:52:07 UTC