- From: Orri Erling <erling@xs4all.nl>
- Date: Thu, 5 Mar 2009 10:10:09 +0100
- To: "'Bob MacGregor'" <bob.macgregor@gmail.com>, "'Simon Gibbs'" <simon.gibbs@cantorva.com>
- Cc: "'Toby Inkster'" <tai@g5n.co.uk>, <public-rdf-dawg-comments@w3.org>
- Message-Id: <200903050910.n259AQQq028052@smtp-vbr8.xs4all.nl>
Hi The IN predicate is generally a good thing. Filter should definitely not be taken to mean filter after generating results. Filter is just conditions, it says nothing about evaluation order. As a matter of fact, select ?x where {filter (?x = 42)} does evaluate to 42 with Virtuoso. With SQL, we do all the expected tricks with IN whether with subquery or scalar expression list. This includes iterating over the IN values before joining if the left side of IN is an indexable column. Otherwise, we get the values and test for the IN condition after the fact. In the scalar exp list variant of IN, we allow an expression to return an array and then treat the elements of the array as if they were top level values of the right hand side of IN. With SPARQL, you could write: select ?syn where { <xxx> ?p ?o . filter (bif:one_of_these (?p, ?interesting)) . ?syn ?p ?o} Syn is anything that shares at least one interesting property value with <xx>. ?interesting is a parameter bound in an outer context. bif:one_of_these is just an internal form of IN, as we do not have IN as a special case in the SPARQL parser. Might as well, for that matter. In many cases, subproperty or subclass inference will substitute for this. Being able to pass a variable number of values for IN is good in special cases if one knows exactly what one wants and does not want to materialize these things into some graph. But this would require importing the XQuery sequence type. Personally I think this is all fine but it may be that this is too complicated to make it through the committee process. Can always try, though. But if we have IN with scalars we should have IN with subquery also. The subquery variant of IN resolves to an existence subquery. I.e. select count (*)from customer where c_id in (select o_c_id from orders where o_date > xxx) is the set of customers with at least one order newer than xx. This can be said as select c_id from customer where exists (select * from orders where o_c_id = c_id and o_date > xxx) We optimize all these things by first making these into existences and then recognizing different special cases. In SPARQL, I suppose this would be select ?c where { ?c a <customer> . filter )!unsaid { ?c has_order ?o . ?o has_date ?d . filter (?d > xxx)})} With Virtuoso we do this SQL style with exists and a subquery select ?c where {?c a customer . filter (bif:exists ((select (1) where {?c has_order ?o . ?o has_date ?d . filter (?d > xxx) })) } But the !unsaid is maybe nicer, more SPARQL like. But it does not cut it for the scalar subquery. select ?c ((select count (*) where {?c has_order ?o . ?o has_date ?d . filter (?d > xxx)})) where { ?c a customer} order by desc 2; This is the set of customers with new orders, the one with the most new orders first. Again, the filter says nothing about execution order. If the cardinality of the recent orders is determined to be less than the cardinality of customers, the new orders becomes the outer loop. The proposal would then be: variable IN (<expression_list>) or variable in <scalar subquery> where the latter is a syntax sugar for exists. The expression klist would be expressions, of which some could evaluate to sequences. Parameters have already been proposed earlier. Orri _____ From: public-rdf-dawg-comments-request@w3.org [mailto:public-rdf-dawg-comments-request@w3.org] On Behalf Of Bob MacGregor Sent: Wednesday, March 04, 2009 7:13 PM To: Simon Gibbs Cc: Toby Inkster; public-rdf-dawg-comments@w3.org Subject: Re: Feature request: IN operator I want to add my vote for this feature, but with a caveat. I use it a lot already, because I'm using a query language more powerful than SPARQL. The power of the construct comes from the ability GENERATE a set of variable bindings from a list, not from FILTERING. SPARQL loses a great deal of expressive power because of the fact that FILTER is not supposed to bind variables. I often find I am forced to make several queries and then union the results in program code to overcome that limitation. We have a similar situation here. The optimal query ordering is usually to first bind from the IN list, and then use that binding when joining to other clauses. A counter argument says that FILTER is not really a filter after all; instead, a query planner should be able to take clauses from within FILTER and apply them in any order. If that really were the case (that FILTER as a "filter" is a fiction) then the following query would work: select ?x where { filter (?x = 42) } But it does not work, and we are all worse off because it does not. My concern is that IN will be hamstrung in the same manner as "=". Cheers, Bob On Wed, Mar 4, 2009 at 8:31 AM, Simon Gibbs <simon.gibbs@cantorva.com> wrote: I was about to suggest that inference could handle that particular example, as far as I can tell the case is handled by sub properties (happy to be corrected, but with the variable in another position this feature would be desirable for batch operations involving a list of subjects or objects of interest. It is used very frequently in SQL. Possible use cases would be augmenting, say a few hundred thousand records about musicians with information about the genre of music they play (something I am likely to do in the near future). Another use case might involve retrieving additional columns of data in a tabular UI, with batch sizes equal to the number of records in the viewport and the content of the IN group taken from a column bound to an inverse functional property. This would allow extra columns to be requested and populated on the fly without performing a fresh query for the whole table. Having an explicit syntax would allow implementors to infer the possibility of a longer list than || might suggest and therefore they might be able to trigger appropriate optimizations. This might in turn yield a higher maximum batch size and better overall performance. FWIW I would prioritize aggregation functions (MIN, MAX, COUNT, AVERAGE, GROUP BY etc) above this feature as the distribution of graphs allows a work around for the missing IN (see e.g. Talis' augmentation functionality) as does the || operator. Toby Inkster wrote: > Analogous to the SQL operator of the same name. > > PREFIX foaf: <http://xmlns.com/foaf/0.1/> > PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema# <http://www.w3.org/2000/01/rdf-schema> > > SELECT ?thing ?name > WHERE { > ?thing ?p ?name . > FILTER (?p IN (foaf:name foaf:nick rdfs:label)) > } > > This query can already be written as > > PREFIX foaf: <http://xmlns.com/foaf/0.1/> > PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema# <http://www.w3.org/2000/01/rdf-schema> > > SELECT ?thing ?name > WHERE { > ?thing ?p ?name . > FILTER (?p = foaf:name || ?p = foaf:nick || ?p = rdfs:label) > } > > But I hope people agree that the former syntax is more legible. > > Formally, IN would be an infix operator taking a term as its first > argument and an rdf:List as its second argument. > > -- ===================================== Robert MacGregor bob.macgregor@gmail.com Mobile: 310-469-2810 =====================================
Received on Thursday, 5 March 2009 09:11:11 UTC