syntax for the algebra - or "shortcuts" for subselect from Axel Polleres on 2010-08-11 (public-rdf-dawg@w3.org from July to September 2010)

From: Axel Polleres <axel.polleres@deri.org>
Date: Wed, 11 Aug 2010 10:50:59 +0100
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <C095C35B-BBB8-4AB6-87E2-0FA869515D2E@deri.org>
(sorry, previous message was unfinished)

Had this in my mind for a while... but didn't have a chance to write it down yet:
Along the discussions around BIND, I am thinking about why only decoupling project expressions 
but not also operators in the algebra that are syntactically bound to (sub)select at the moment, namely:

i) ORDER BY
ii) LIMIT
iii) Project expressions (also a recurring issue in the ongoing discussion about assignment, or BIND)
iv) aggregates

All these have separate operators in the algebra, I think, but no stand-alone synatctic counterpart (i.e., without occuring in a (sub)SELECT)

I want to bring a - preliminary - proposal on the table to add own syntax for i)-iv) which: 
 - actually wouldn't really "add" syntax but rather should be viewed as shortcuts for current subselect queries 
 - BTW serves as a syntax proposal for BIND

Here we go:

(0) As a basement of defining the semantics of all this, it might make sense to base the whole evaluation semantics of patterns on solution sequences, 
   rather than sets: the jumping back and forth between multisets and sequences (toList/toMultiset) IMO just complicates things, why not just go all 
   the way with sequences and just say that in some cases the order is not deterministic or, resp., order may be lost during joins? 

(1) propose to add a syntactic operator:
       Pattern ORDER BY <expr> 
   with the semantics of ordering the solution sequence of Pattern according to the ORDER BY.
   (I see no real reason, why I need a SELECT * around this to do a subquery that just does ordering)

(2) propose to add a new operator 
       Pattern LIMIT number 
   with the semantics of just limiting the solution sequence of Pattern to its first <number> elements.
   ordering of the solution sequence of Pattern is preserved.
   (I see no real reason, why I need a SELECT * around this to do a subquery that just does limiting)

(3) propose to add a new operator
     Pattern BIND var AS expr
   with the semantics of extending the solutions in the solution sequence of Pattern by the binding created in the assignment.
   ordering of the solution sequence of Pattern is preserved.

(4) { Pattern } [GROUP BY vars] Agg(expr) AS expr
   where Agg is an agregate function, with the semantics of grouping the solution sequence of Pattern according to 
   the (optional) GROUP BY clause, and extending the solutions in the resulting grouped solution sequence by the binding created by the aggregation, 
   the bindings for the grouped variables are lost/projected away in this. 
   ordering of the solution sequence of Pattern is lost.

(5) Of course we'd also leave 
      SELECT vars [WHERE] Pattern 
    for projection.
     ordering of the solution sequence of Pattern is preserved (or may be lost, not really sure what makes most sense here).

I think that the components for all 1)-5) are there in the algebra, but we have to tie each of these to a full subSELECT at the moment.

Here are some examples where this IMO could help:

A) from the current draft:

PREFIX : <http://people.example/>
PREFIX : <http://people.example/>
SELECT ?y ?minName
WHERE {
  :alice :knows ?y .
  {
    SELECT ?y (MIN(?name) AS ?minName)
    WHERE {
      ?y :name ?name .
    } GROUP BY ?y
  }
}

could be written:

PREFIX : <http://people.example/>
PREFIX : <http://people.example/>
SELECT ?y ?minName
WHERE {
  :alice :knows ?y .
  { ?y :name ?name . } GROUP BY ?y MIN(?name) AS ?minName }
}


B) from the test cases:

SELECT ?x ?max WHERE {
{SELECT (max(?y) AS ?max) WHERE {?x ex:p ?y} } 
?x ex:p ?max
}

could be written:

SELECT ?x ?max WHERE {
{ {?x ex:p ?y} max(?y) AS ?max } 
?x ex:p ?max
}

C) "give me the publication titles for the top 3 among people with the most DBLP entries"    

  SELECT ?author ?title ?doc
   FROM <dblp>
   { { SELECT ?author (COUNT(?doc) as ?count) WHERE { ?doc dc:creator ?author } GROUP BY ?author 
       ORDER BY ?count LIMIT 3 }
     ?doc dc:creator ?author; dc:title ?title 
   }

 could be written:

   SELECT ?author ?title ?doc
   FROM <dblp>
   { { {{ ?doc dc:creator ?author } GROUP BY ?author COUNT(?doc) AS ?count} 
      ORDER BY ?count LIMIT 3 }
     ?doc dc:creator ?author; dc:title ?title 
   }
 

D) holger Knublauch's example query from 
   http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2009Nov/0000.html

SELECT ?eMail ?image
WHERE {
  { { ?a a:email ?eMail .
      ?a e:fullName ?fullName }
    BIND  ?fullNameSpaceNormalized AS normalize-space(?fullName)              
    BIND  ?firstName  AS substring-before(?fullNameSpaceNormalized," ") 
    BIND  ?lastName=substring-after(?fullNameSpaceNormalized," ") }
  { { ?b b:firstName ?firstName .
      ?b b:lastName ?lastName .
      ?b b:lastName ?altLastName . } 
    BIND ?altName AS concat(?firstName, " ", ?altLastName )  }
  { { ?c c:fullName ?altName .
      ?c c:studyYears ?lengthOfCourse .
      ?c c:matriculationDate ?matriculate . }
    BIND ?endDate AS|year-from-date(add-yearMonthDuration-to-date(?matriculate,?lengthOfCourse)) }
  { { ?d d:year ?endDate .
    ?d d:fileName ?imageFile . }
    BIND  ?image AS xs:anyURI(concat("http://www.example.org/photos", ?imageFile, ".jpg" ) ) }
}

 don't know whether that would make Holger/Jeremy happy, but it looks pretty close to the assign version)

Opinions/comments welcome, even if I won't fight for it, I wanted to bring this up before we close down completely for LC. 
Especially, I'd be interested in opinions from the query editors whether they think it would require much effort? 
Mainly, because I think that (0) could potentially simplify the definition of the algebra, but also mean considerable effort to be implemented.
Let me also emphasise that 3) could be probably viewed independent from adopting 0), 1), 2), and 4) anyways... 

Axel

1. http://www.w3.org/TR/rdf-sparql-query/#sparqlAlgebra Definition:Diff
Received on Wednesday, 11 August 2010 09:51:31 UTC