Re: Performance comparison: direct select or regex filter ? from Paul Gearon on 2010-04-29 (public-sparql-dev@w3.org from April to June 2010)

From: Paul Gearon <gearon@ieee.org>
Date: Thu, 29 Apr 2010 14:41:15 -0400
To: Richard Newman <rnewman@franz.com>
Cc: Parsa Ghaffari <parsa.ghaffari@gmail.com>, public-sparql-dev@w3.org
Message-ID: <s2wa25ac1f1004291141p97b16635g42b71968c6a8d320@mail.gmail.com>

On Thu, Apr 29, 2010 at 2:13 PM, Richard Newman <rnewman@franz.com> wrote:
>> It *is* possible for a clever implementation to see that the filter is
>> a very simple operation that can be duplicated with a lookup in an
>> index that sorts lexically. So if your store is the right kind, and
>> you have a clever optimizer, it can be just as efficient as for (1).
>> In reality, I don't believe any optimizers are quite this clever
>> (someone, please tell me if I'm wrong here!) so you won't that level
>> of performance.
>
> Those queries aren't actually equivalent:
>
> * The regex doesn't anchor to the end of the string, so "John Smith Bar"
> would match the regex but not the direct match
> * The regex (in a strict implementation) will only match a simple literal;
> the direct match will match against a lang-tagged literal
>
> ... so it's not possible for an implementation to run the latter exactly as
> quickly as the former, if only because it'll produce different results!

Sorry, I thought it was clear that these were not identical. While the
results will certainly be different (the direct match is looking for a
single literal with an "en" language, while the regex looks for any
literal that starts that way), the performance can be the same.

The reason I say that is from experience with Mulgara. It stores it's
literals lexically, and internally it has operations to search by the
start of the string in a literal. This operation is almost identical
to searching by the entire contents of string literals (it returns
true if the literal being searched for matches and is the same length
or longer. It also needs 2 searches to find the first and last
matching literal). There is no optimization for seeing that the output
of this regex could be found as a lookup in the index, however there
is an extension that lets you ask for all literals that start with a
given string, which is what that particular regex is doing. So if the
query were rewritten to use this proprietary operation, then it would
do what the regex does, and it would do it with the same complexity as
the query that used direct matching.

(yes, it would take *slightly* longer because of the need to search
the strings twice, but the nodes of the tree will all be in cache, and
doing something twice doesn't change the complexity)

> To your point, though (the reason I started writing this email): there's not
> too much motivation for implementors to try to detect and optimize this kind
> of edge case, because nobody should be writing a query in the second form :)

There are reasons for wanting to see all strings that start a
particular way (for instance, you might store identifiers with a form
like "SKU:123456789", and you want to select everything that starts
with "SKU:"), and that regex would let you do it. In Mulgara's case,
we didn't want the expense of a filter for this kind of operation when
the index had what we needed to do it faster, which is why there's the
query extension that lets you do it.

(Actually, the reason it's in Mulgara is not for searching on strings,
but for searching on namespaces in URIs. Since URIs are a kind of
string, the operation was extended to be used on strings as well)

> One thing I've experimented with is detecting some kinds of regexes and
> compiling them into free-text index operations. That's likely a middle
> ground.

I've considered the same, but so far I've been of the opinion that
there are a lot of other areas I need to improve performance in first.
:-)

Paul

Received on Thursday, 29 April 2010 18:41:50 UTC