Re: SPARQL performance for ORDER BY on large datasets

> Sure, it's understandable (AFAIK) that this specific case is much
> easier to use an SQL DB for (or something like CouchDB). But I was

CouchDB 'cheats' by defining queries in advance, updating result sets as data is added/removed from DB

this is a far cry from 'ad-hoc SPARQL' but more in-line with typical Web use-cases, by far (imo)

> surprised that it was *this* terrible. It seems unfeasible to build
> e.g. chronological feeds from larger RDF stores using SPARQL with
> current tools.

i found chronological sorts key to typical web use cases,

and solved the problem by 'fractal decomposition' of URIs into sorted filesystem tries (i use btrfs and reiser4, 256 bytes or so per node, virtuoso ues ~128 per triple),

alongside URI-fication of literals,

eg index/po/.../org/purl/dc/1.1/modified/.../u/2009/08/26/.../org/myblog/posts/hey_this_is_cool
eg index/po/.../org/purl/dc/1.1/modified/.../u/2009/08/25/.../com/lustre/2009/08/treehouses_for_sale

so a simple depth-first (with optional range offset) traversal gets me the posts i need:

  def take s=Infinity,t=true,e=true,v=:desc,o=nil # size, leafs only?, use terminators?, direction, startpoint
    i=to_s.size;l=false;r=[];v,m={asc: [:id,:>=], desc: [:reverse,:<=]}[v]
    a=->n{s=s-1;r.push n}
    g=->b{b.sort_by(&:to_s).send(v).each{|n|
        return if 0 >= s
        (l || !o || n.to_s[i..i+o.size-1].send(m,o[0..(n.to_s.size - i - 1)])) && # inside
        (if !(c=n.c(e)).empty?
          g.(c) # children
          a.(n) unless t # branch
        else
          a.(n); l=true unless l #leaf
        end)}}
    g.(c(e)); r
  end

 (the rest of the code @ http://repo.or.cz/w/element.git

the 'planet' exampe fetches intertwingly,haskell,mozilla and other blogs, and can page thru 100s of posts at a time no prob including HTML rendering (using my 64 GB samsung SSD anywyas)

adding datetime to resource URI (w3 /2009/08/26 suggestion, adding hh:mm:ss: when necessary) enables filtering of results within a 'join' 

eg index/po/.../org/w3/1999/rdfs/type/.../info/chatterbox/tweet/.../org/myname/tweets/2009/08/

of course POST handlers or Couch style precalculation can go even crazier..


basiclaly, most of the semweb zealots wlll tell you 'URIs are Opaque'. and thats fine, theres plenty of stores that will treat them as so.

but as soon as you ewant to find the 'directory children' or have to ru a REGEX(str()) on every URI in the system, you might end up in my situation (sounds like you did)



i found Virtuoso satisfact




> 
> 
> Is this -- ORDER BY performance -- a commonly known problem, and
> considered an issue of importance (for academia and implementers
> alike)?
> 
> Or am I missing something really obvious in the setup of these stores,
> or in my query? I welcome *any* suggestions, such as "use triple store
> X", "for X, make sure to configure indexing on Y". Or do RDF-using
> service builders in general opt out to indexing in something else
> entirely in these cases?
> 
> 
> (It seems queries like this are present in the Berlin SPARQL Benchmark
> (e.g. #8), but I haven't analyzed this correlation and possible
> meanings of it in depth.)
> 
> Best regards,
> Niklas Lindström

Received on Wednesday, 26 August 2009 20:44:08 UTC