Re: [Fwd: major technical: no subqueries] from Jeen Broekstra on 2006-01-13 (public-rdf-dawg@w3.org from January to March 2006)

From: Jeen Broekstra <jeen.broekstra@aduna.biz>
Date: Fri, 13 Jan 2006 11:46:35 +0100
To: Dan Connolly <connolly@w3.org>
CC: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <43C7850B.8090506@aduna.biz>
Dan Connolly wrote:

> This seems like a reasonably coherent argument for a new requirement,
> complete with rationale and use case.

First of all, I must say that although I am quite in favor of accepting 
forms of subquerying into the language, and even think that it will be 
necessary, the timing seems to be bad for this.

As a parallel, we have added various forms of subqueries to the SeRQL 
languages at a later point as well. We did recognize the use cases for 
it from the very beginning but decided to start out small and work from 
there. This has proven to be a happy decision for us.

As a matter of fact, subqueries in SeRQL were added by an intern who was 
doing his MSc project with us. It took him about 2-3 months to get it 
right (but of course, he had little prior experience). Our own estimate 
for an experienced programmer to add such features would be in the order 
of 2 person-weeks. Of course, this is only one data point in a single 
implementation of a query engine and does not take things like query 
optimization into account.

 From a language design perspective, I see no great obstacles into 
allowing subqueries into SPARQL, but it must be recognized that the 
added implementation burden is significant. To me, the more logical 
route would be to recognize this as a useful feature and to postpone it, 
for now.

I would also like to point out that there are more forms of subquerying 
than are sketched in the user's comments (for example, things like ANY 
and ALL modifiers, or the IN set membership operator) and I feel that if 
we decide to put this on the critical path, we should take a good look 
at all of these. Which is another good reason to postpone for now, IMHO.

> From:
> Fred Zemke <fred.zemke@oracle.com>
 >
> Section 10.3.2 "Accessing graphs in the RDF dataset"
> observes that it is possible to extract subgraphs of the
> input graphs using elementary CONSTRUCT queries.  Once a user does
> this, he may presumably direct the output to some storage medium,
> assign an IRI to it and then run a query against that extract.
> Or with the right operating system interface, he might be able to
> "pipe" the output of a CONSTRUCT into the FROM clause of another
> SPARQL query. It would be useful to avoid the need for explicitly 
> storing or
> piping the result before performing further queries on it.  One way to
> do this would be to extend the FROM clause to permit a CONSTRUCT
> query as either the default graph or a named graph, for
> example SELECT * FROM ( CONSTRUCT ... ) ...
> 
> This is of course analogous to subqueries and in-line views in SQL. The 
> originators of SQL mistakenly believed that they did not need
> subqueries, so subqueries were not part of the original design.
 >
 > In the case of SPARQL, perhaps it is true that any query that could be
 > written with a
 > CONSTRUCT in the FROM clause could be rewritten to avoid it.
 > However, experience in SQL and other languages show that it is still 
a good
 > idea to permit composability wherever it makes sense semantically,
 > and leave it to the implementation to find the optimization.

Our experience with Sesame/SeRQL indicates that even though not part of 
the original design, adding it later on was no great burden from a 
design perspective. I expect that a similar path for SPARQL will not 
pose grave dangers.

While the use case is compelling and I am quite convinced in general 
that subqueries are useful, perhaps even necessary, I think that at this 
stage we should restrict ourselves to a simple language to encourage 
early adoption rather than aiming for an all-singing-all-dancing spec 
that is significantly harder to write a conforming processor for.

[snip]

> I also advocate another kind of subquery: allow an ASK as a boolean
> expression.  This will provide an alternative way to formulate
> non-existence queries.  For example, the query to find people with no
> dc:date in section 11.2.3.1, currently written as:
> 
> PREFIX foaf: <http://xmlns.com/foaf/0.1>
> PREFIX dc: <http://purl.org/dc/element/1.1>
> SELECT ?name
> WHERE { ?x foaf:givenName ?name .
>        OPTIONAL { ?x dc:date ?date } .
>        FILTER (!bound(?date)) }
> 
> could be expressed:
> 
> SELECT ?name
> WHERE { ?x foaf:givenName ?name .
>        FILTER ( ! ( ASK ?date WHERE { ?x dc:date ?date } ) ) }
> 
> I think that some might find the formulation using ASK more intuitive.
> (I know, some might disagree.)

I would like to point out that this is actually equal to having an 
EXISTS() operator in the language. In SeRQL this would be expressed like so:

  SELECT name
  FROM {x} foaf:givenName {name}
  WHERE NOT EXISTS (SELECT date FROM {x} dc:date {date})

I have previously heard voices against having such functions, on the 
argument that they require a closed-world-assumption to function. 
Personally I've never found that argument very compelling (I'm no 
logician but IMHO the semantics can be easily rewritten to allow a 
K-like operator - heck, just call it KNOWN instead of EXISTS), but it is 
the same thing.

Jeen
-- 
Jeen Broekstra          Aduna BV
Knowledge Engineer      Julianaplein 14b, 3817 CS Amersfoort
http://aduna.biz        The Netherlands
tel. +31 33 46599877
Received on Friday, 13 January 2006 10:47:22 UTC