Re: Performance comparison: direct select or regex filter ? from Paul Gearon on 2010-04-29 (public-sparql-dev@w3.org from April to June 2010)

From: Paul Gearon <gearon@ieee.org>
Date: Thu, 29 Apr 2010 12:47:44 -0400
To: Parsa Ghaffari <parsa.ghaffari@gmail.com>
Cc: public-sparql-dev@w3.org
Message-ID: <x2za25ac1f1004290947taaf68707l35ec6caabb0b1db1@mail.gmail.com>

On Thu, Apr 29, 2010 at 12:22 PM, Parsa Ghaffari
<parsa.ghaffari@gmail.com> wrote:
> Hi there,
> How different are these queries in terms of performance:
> 1) SELECT * WHERE { ?x foaf:name "John Smith"@en }
>
> 2) SELECT * WHERE { ?x foaf:name ?y. FILTER regex(?y,"^John Smith") }

This will depend on the exact implementation, but 1 will be faster.

Most databases will index by object. This means that (1) will usually
be a simple index lookup. For a standard system with tree-based
indexes, this will have logarithmic complexity (O(log(N)). For
in-memory systems with hashed indexes, this will be constant (O(1)).

(2) will require a lookup (as in (1)), followed by a filter operation.
This will require N operations (linear or O(N)) multiplied by the
complexity of the regex (which can be ignored for complexity since it
is simple, but will have some real cost). The overall complexity is
linear: O(N). This is much worse than for (1).

It *is* possible for a clever implementation to see that the filter is
a very simple operation that can be duplicated with a lookup in an
index that sorts lexically. So if your store is the right kind, and
you have a clever optimizer, it can be just as efficient as for (1).
In reality, I don't believe any optimizers are quite this clever
(someone, please tell me if I'm wrong here!) so you won't that level
of performance.

There are databases that offer alternatives to (2), which let you
specify that you're searching lexically through an index, thereby
doing an equivalent operation to (2), while offering the performance
of (1). However, that is highly implementation dependent, and the
mechanism for saying what you want in the query is not standardized.

Regards,
Paul Gearon

Received on Thursday, 29 April 2010 16:48:18 UTC