Re: RDF and its discontents from Paul Houle on 2010-07-02 (public-lod@w3.org from July 2010)

From: Paul Houle <ontology2@gmail.com>
Date: Fri, 2 Jul 2010 12:55:32 -0400
To: Henry Story <henry.story@gmail.com>
Cc: Linked Data community <public-lod@w3.org>
Message-ID: <AANLkTimWron5JhxEzj0L5FLEBj-POEuV0j0T9GPbJZP6@mail.gmail.com>

On Fri, Jul 2, 2010 at 11:20 AM, Henry Story <henry.story@gmail.com> wrote:

>
>
> So similarly with RDF stores. Is it not feasible that one may come up with
> just in time
> storage mechanisms, where the triple store could start analyising how the
> data was used in
> order then to optimise the layout of the data on disk?  Perhaps it could
> end up being a
> lot more efficient than what a human DB engineer could do in that case.
>
>
That's a nice research project and it could be a very nice project if it's
perfected.  Salesforce.com has a patent on something that's pretty similar:

http://www.faqs.org/patents/app/20090276395

I attended a talk at Dreamforce last year where they described how their
system works.

To a developer,  salesforce.com offers something that looks a lot like a
relational database.

Their customers are spread out on about 10 distinct Oracle 10g clusters;
 each of these has a central "fact" table which is essentially a triple/quad
store.  "Rows" seen from the customer's perspective are actually atomized
into individual triples...  the core table,  however,  has additional tags
which identify each triple as belonging to a particular
salesforce.comcustomer.  This way there might be 10,000-100,000
customers that share an
'instance' of the Salesforce.com system.

Now,  to supplement this,  Salesforce.com creates additional relational
tables in Oracle that speed up particular queries.  It uses automatic
profiling to decide when it's going to create these tables,  create indexes,
 etc.

It's pretty amazing to watch.  I've built a system that communicates with
Salesforce.com via the API.  The first time I run it against a salesforce
instance,  one of the queries it runs times out.  If I run it again
immediately,  it times out again.  If I come back in ten minutes,  it works
O.K.  because the system has analyzed my query and built the structures to
make the query efficient.

That said,  Salesforce.com is designed for OLTP applications and sucks for
analytical work.  You're only allowed to get information in limited size
chunks; until very recently there wasn't anything like GROUP BY.  More to
the point,  Salesforce.com charges about $1500/month/GB of storage.  This is
affordable for OLTP work,  but the semantic work I do involves so much data
that I couldn't possibly afford that.

Received on Friday, 2 July 2010 16:56:05 UTC