Re: [TF-LIB] IN operator from Ivan Mikhailov on 2010-02-08 (public-rdf-dawg@w3.org from January to March 2010)

From: Ivan Mikhailov <imikhailov@openlinksw.com>
Date: Mon, 08 Feb 2010 17:00:00 +0600
To: Andy Seaborne <andy.seaborne@talis.com>
Cc: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <1265626800.17678.6404.camel@octo.iv.dev.null>

> IN is a operator with the same precedence as EQ etc.
> 
> Syntax:
>      expr IN ( expr1, expr2, ....)
>      expr NOT IN ( expr1, expr2, ....)

We've implemented IN, because it's very convenient to keep SQL happy and
not implemented NOT IN syntax simply because I was too lazy. Moreover,
our lod.openlinksw.com/sparql regularly gets queries with filters like
((?var=value1) || (?var=value2) ||... || (?var=value300)). The OR
operator is boolean in SQL so SQL optimizer would die so I had to
recognize such subexpressions and rewrite to IN operator first, so the
code for IN operator in SPARQL optimizer even without visible IN syntax
in queries.

> Semantics:
> 
> Evaluation is equivalent to writing out in long form:
> 
> IN ==>
>   expr =  expr1 || expr = expr2 || ...
>
> NOT IN ==>
>   expr != expr1 && expr != expr2 && ...

+1

>    9 IN (1, 2, 1/0) is error
>    9 NOT IN (1, 2, 1/0) is error

I'd be happy if 9 IN (1, 2, 1/0) is error or false, and 9 NOT IN (1, 2,
1/0) is error or true, depending on implementation and/or roll of dice.

The reason for implementation-specific behavior is that the optimizer
may calculate a constant value of the expression compile-time. Consider
FILTER (?v1 IN (1, 2, ?v2/?v3))
in a context such that the optimizer has proven that ?v1 is an IRI
and ?v2,?v3 are numbers. It would be nice to replace the whole
expression with false and wipe out a whole group pattern. Other case is
rewriting of IN into OR of equalities and then rewriting a group pattern
with OR filter into UNION of patterns. If the query has LIMIT than the
bad branch may stay undetected.

The reason for roll of dice is that the compiler may decide to sort the
list of variants to replace sequence of comparisons with a binary search
(if result set is filtered with IN) or to get better table lookup
locality (say, if ?s ?p ?o . FILTER (?p IN (values)) drives a sequence
of PSO index lookups).

IN should be based on equality. For IN based on SAMETERM, a special
function might be introduced.

If both scalar subqueries and IN are supported then
?expn IN (SELECT...)
should also be supported, of course.

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com

Received on Monday, 8 February 2010 11:00:41 UTC