Re: rdf.rb/spira bulk read question from Ben Lavender on 2011-03-02 (public-rdf-ruby@w3.org from March 2011)

From: Ben Lavender <blavender@gmail.com>
Date: Wed, 2 Mar 2011 10:08:58 -0600
To: Greg Lappen <greg@lapcominc.com>
Cc: Gabor Ratky <gabor@secretsaucepartners.com>, public-rdf-ruby@w3.org
Message-ID: <AANLkTikVnbffok8okCPW1VAG7UocuqbXADt4-iUaYDA0@mail.gmail.com>
On Wed, Mar 2, 2011 at 9:57 AM, Greg Lappen <greg@lapcominc.com> wrote:
> Hmm, I will have to look at Query#execute more closely and think if there's
> something that can be done in CouchDB to make graph queries more efficient.
>  But I was surprised by the iteration, because in my mind, graph queries we
> like a UNION - each pattern could be one query, and the union of resulting
> statements would provide the solutions, resulting in much less queries...not
> sure if that's realistic or not though without a closer look at
> Query#execute.

The algorithm is more subtle than that. The patterns return bindings,
not statements, and you can't simply union each pattern, as you need
to constrain later patterns or you'll end up doing intermediate
patterns that return the entire repository. You could improve the
performance over RDF.rb's by making each pattern take a list of
existing bindings and checking them in couch, instead of having Ruby
iterate over them and re-running the query, but at a Big O level,
someone, somewhere, has to check all of the existing bindings against
later patterns.

> RE: Sesame vs. CouchDB, I have been actively investigating other storage
> options, but CouchDB is the only one where the replication is a user-level,
> runtime function. MySQL and MongoDB support master-slave replication, but
> it's a static configuration.  Tokyo Tyrant supports master-master
> replication, but again, at configuration time.
> If we did want a separate SPARQL server, 4store seems to be more scalable,
> although it is self-advertised that way and I haven't verified it.
>
> On Wed, Mar 2, 2011 at 10:49 AM, Ben Lavender <blavender@gmail.com> wrote:
>>
>> Currently, that is a correct understanding. I think we'd be willing to
>> accept a patch that checks if the given queryable has its own
>> implementation of Query#execute and uses that if found, and the
>> default if not. That should maybe even be the default, making
>> Query#execute call out to some method on Queryable that holds the
>> current BGP logic, which implementations can overwrite.
>>
>> OTOH most implementations won't be able to do anything much more
>> effectively than the default algorithm. It is what it is.
>>
>> If replication is your main goal, I'd suggest that several stores,
>> i.e. Sesame, can quite effectively use MySQL as a backend and you
>> could use that replication.
>>
>> Ben
>>
>> On Wed, Mar 2, 2011 at 9:37 AM, Greg Lappen <greg@lapcominc.com> wrote:
>> > Yes, not only am I using ipublic/rdf-couchdb, I WROTE it!  I'm
>> > pleasantly
>> > surprised to find that someone else has tried to use it, ha!
>> > I'd love input on how to make the implementation less naive...I have
>> > implemented the query_pattern method to use couchdb views instead of
>> > iterating over the entire repo, but is there more to it?  I think the
>> > looping behavior on the graph queries is a consequence of the graph
>> > query
>> > implementation, not the backend, right?
>> >
>> > On Wed, Mar 2, 2011 at 10:31 AM, Gabor Ratky
>> > <gabor@secretsaucepartners.com>
>> > wrote:
>> >>
>> >> Are you using Dan Thomas' rdf-couchdb project?
>> >> (https://github.com/ipublic/rdf-couchdb) I've found the project a naive
>> >> RDF::Repository implementation on top of CouchDB in many ways. Great
>> >> proof
>> >> of concept with rdf-spec tests passing, but definitely needs work,
>> >> especially in the 'efficient querying' space, IMHO.
>> >> Are you taking a hard dependency on CouchDB in other parts of your
>> >> architecture (like us), or just chose it as an RDF repository?
>> >> Gabor
>> >> On Mar 2, 2011, at 3:20 PM, Greg Lappen wrote:
>> >>
>> >> Hi all,
>> >> We are making good progress with our project, and I've gotten to the
>> >> point
>> >> where I am storing datasets in our rdf repository (rdf.rb based,
>> >> implemented
>> >> on couchdb).  Now I'm building a page that allows the data to be
>> >> exported in
>> >> various formats (xml, csv, etc), but when I iterate over all of the
>> >> data, it
>> >> is extremely slow.  I see Spira querying the repository once for each
>> >> instance when I iterate using the model's "each" method.  I understand
>> >> why,
>> >> I'm just wondering if there's a faster way to query all of the
>> >> instances of
>> >> a Spira class.  One thought we had was to use a graph query instead,
>> >> which
>> >> would pull out all the properties in N queries (where N is the number
>> >> of
>> >> properties in the class).  In the example I'm trying, this would be 23
>> >> queries, which is better than hundreds or thousands of queries. Is this
>> >> as
>> >> good as it gets?  I'm accustomed to working with RDBMS and
>> >> ActiveRecord, so
>> >> I may just have to shift my expectations a bit, but thought I would ask
>> >> the
>> >> group if there's something I'm missing....thanks as always,
>> >> Greg
>> >
>> >
>
>
Received on Wednesday, 2 March 2011 16:09:55 UTC