RE: Feature request: IN operator from Orri Erling on 2009-03-05 (public-rdf-dawg-comments@w3.org from March 2009)

From: Orri Erling <erling@xs4all.nl>
Date: Thu, 5 Mar 2009 10:10:09 +0100
To: "'Bob MacGregor'" <bob.macgregor@gmail.com>, "'Simon Gibbs'" <simon.gibbs@cantorva.com>
Cc: "'Toby Inkster'" <tai@g5n.co.uk>, <public-rdf-dawg-comments@w3.org>
Message-Id: <200903050910.n259AQQq028052@smtp-vbr8.xs4all.nl>
 

 

Hi

 

The IN predicate is generally a good thing.  Filter should definitely not be  taken to mean filter after generating results.  Filter is just conditions, it says nothing about evaluation order.

 

As a matter of fact,  select ?x where {filter (?x = 42)} does evaluate to 42 with Virtuoso.

 

With SQL, we do all the expected tricks with IN whether with subquery or scalar expression list.  This includes iterating over the IN values before joining if the left side of IN is an indexable column.  Otherwise, we get the values and test for the IN condition after the fact.  

 

In the scalar exp list variant of IN, we allow an expression to return an array and then treat the elements of the array as if they were top level values of the right hand side of IN.  With SPARQL, you could write:

 

select ?syn where { <xxx> ?p ?o . filter (bif:one_of_these (?p, ?interesting)) . ?syn ?p ?o}

 

Syn is anything that shares at least one interesting property value with <xx>.  ?interesting is a parameter bound in an outer context.  bif:one_of_these is just an internal form of IN, as we do not have IN as a special case in the SPARQL parser.  Might as well, for that matter.

 

In many cases, subproperty or subclass inference  will substitute for this.  Being able to pass a variable number of values for IN is good in special cases if one knows exactly what one wants and does not want to materialize these things into some graph.  But this would require importing the XQuery sequence type.  

 

Personally I think this is all fine but it may be that this is too complicated to make it through the committee process.  Can always try, though.  

 

 

 

But if we have IN with scalars we should have IN with subquery also.

 

 

 

The subquery variant of IN resolves to an existence subquery.

 

I.e.  select count (*)from customer where c_id in (select o_c_id from orders where o_date > xxx)

 

is the set  of customers with at least one order newer than xx.

 

This can be said as 

 

select c_id from customer where exists (select * from orders where o_c_id = c_id and o_date > xxx)

 

We optimize all these things by first making these into existences and then recognizing different special cases.

 

In SPARQL, I suppose this would be

 

select ?c where { ?c a <customer> . filter )!unsaid { ?c has_order ?o . ?o has_date ?d . filter (?d > xxx)})}

 

With Virtuoso we do this SQL style with exists and a subquery

 

select ?c where {?c a customer . filter (bif:exists ((select (1) where {?c has_order ?o . ?o has_date ?d . filter (?d >  xxx) })) }

 

But the !unsaid is maybe nicer, more SPARQL like.  But it does not cut it for the scalar subquery.

 

select ?c ((select count (*) where {?c has_order ?o . ?o has_date ?d . filter (?d > xxx)})) where { ?c a customer} order by desc 2;

 

This is the set of customers with new orders, the one with the most new orders first.

 

 

 

Again, the filter says nothing about execution order.  If the cardinality of the recent orders is determined to be less than the cardinality of customers, the new orders becomes the outer loop.

 

 

 

The proposal would then be:

 

variable IN (<expression_list>)

 

or 

 

variable in <scalar subquery>

 

where the latter is a syntax sugar for exists.

The expression klist would be expressions, of which some could evaluate to sequences.  

 

Parameters have already been proposed earlier. 

 

 

 

Orri

 

 

 

 

 

 

  _____  

From: public-rdf-dawg-comments-request@w3.org [mailto:public-rdf-dawg-comments-request@w3.org] On Behalf Of Bob MacGregor
Sent: Wednesday, March 04, 2009 7:13 PM
To: Simon Gibbs
Cc: Toby Inkster; public-rdf-dawg-comments@w3.org
Subject: Re: Feature request: IN operator

 

I want to add my vote for this feature, but with a caveat.  I use it a lot already, because
I'm using a query language more powerful than SPARQL.  The power of the construct
comes from the ability GENERATE a set of variable bindings from a list, not from
FILTERING.  SPARQL loses a great deal of expressive power because of the fact
that FILTER is not supposed to bind variables.  I often find I am forced to make several
queries and then union the results in program code to overcome that limitation.

We have a similar situation here.  The optimal query ordering is usually to first
bind from the IN list, and then use that binding when joining to other clauses.
A counter argument says that FILTER is not really a filter after all; instead, a
query planner should be able to take clauses from within FILTER and apply
them in any order.  If that really were the case (that FILTER as a "filter" is a fiction) then
the following query would work:
    select ?x where { filter (?x = 42) }

But it does not work, and we are all worse off because it does not.  My concern
is that IN will be hamstrung in the same manner as "=".

Cheers, Bob

On Wed, Mar 4, 2009 at 8:31 AM, Simon Gibbs <simon.gibbs@cantorva.com> wrote:

I was about to suggest that inference could handle that particular
example, as far as I can tell the case is handled by sub properties
(happy to be corrected, but with the variable in another position this
feature would be desirable for batch operations involving a list of
subjects or objects of interest.  It is used very frequently in SQL.

Possible use cases would be augmenting, say a few hundred thousand
records about musicians with information about the genre of music they
play (something I am likely to do in the near future).

Another use case might involve retrieving additional columns of data in
a tabular UI, with batch sizes equal to the number of records in the
viewport and the content of the IN group taken from a column bound to an
inverse functional property. This would allow extra columns to be
requested and populated on the fly without performing a fresh query for
the whole table.

Having an explicit syntax would allow implementors to infer the
possibility of a longer list than || might suggest and therefore they
might be able to trigger appropriate optimizations. This might in turn
yield a higher maximum batch size and better overall performance.

FWIW I would prioritize aggregation functions (MIN, MAX, COUNT, AVERAGE,
GROUP BY etc) above this feature as the distribution of graphs allows a
work around for the missing IN (see e.g. Talis' augmentation
functionality) as does  the || operator.


Toby Inkster wrote:
> Analogous to the SQL operator of the same name.
>
> PREFIX foaf: <http://xmlns.com/foaf/0.1/>
> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema# <http://www.w3.org/2000/01/rdf-schema> >
> SELECT ?thing ?name
> WHERE {
>   ?thing ?p ?name .
>   FILTER (?p IN (foaf:name foaf:nick rdfs:label))
> }
>
> This query can already be written as
>
> PREFIX foaf: <http://xmlns.com/foaf/0.1/>
> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema# <http://www.w3.org/2000/01/rdf-schema> >
> SELECT ?thing ?name
> WHERE {
>   ?thing ?p ?name .
>   FILTER (?p = foaf:name || ?p = foaf:nick || ?p = rdfs:label)
> }
>
> But I hope people agree that the former syntax is more legible.
>
> Formally, IN would be an infix operator taking a term as its first
> argument and an rdf:List as its second argument.
>
>






-- 
=====================================
Robert MacGregor
bob.macgregor@gmail.com
Mobile: 310-469-2810
=====================================
Received on Thursday, 5 March 2009 09:11:11 UTC