[ISSUE-8] subqueries and datasets from Lee Feigenbaum on 2009-05-08 (public-rdf-dawg@w3.org from April to June 2009)

From: Lee Feigenbaum <lee@thefigtrees.net>
Date: Fri, 08 May 2009 00:40:38 -0400
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4A03B7C6.4070104@thefigtrees.net>
(Not sure if the email tag identifying the issue is useful. What do you 
think?)

This email discharges my action 
http://www.w3.org/2009/sparql/track/actions/21

SPARQL queries are executed against an RDF dataset, which contains a 
default graph (possibly constructed from the RDF merge of multiple 
graphs) and zero or more named graphs.

In SPARQL/Query 1.0 (nee SPARQL?), a query's RDF dataset is determined 
by the first of these that applies:

1) The dataset specified in the protocol (via default-graph-uri and 
named-graph-uri)

2) The dataset specified in the query (via FROM and FROM NAMED)

3) implementation defined

At the F2F, we heard 2 different designs for determining what RDF 
dataset a subquery should be executed against.

ARQ - subqueries always use the same dataset used for executing the 
parent (container) query.

Virtuoso - subqueries can specify their own dataset. I wasn't totally 
clear on what the priority of that is with respect to the protocol.

Ignoring the protocol for a second, I think there are two possibilities.

# Select details of all recent posts, given a graph which enumerates
# which posts those are. (This is a poor example, it could be done with
# regular GRAPH clauses and doesn't need a subquery.)
SELECT *
FROM ex:all_posts
{
   ?post dc:title ?title .
   {
     SELECT ?post FROM ex:recent_posts { ?post a ex:Post }
   }
}

Option 1: The subquery executes against ex:recent_posts - The rule would 
be that a subquery can specify its own dataset which trumps the parent's 
dataset

Option 2: subqueries can't specify their own dataset - in this case, I'd 
suggest this should be an explicit error

With the protocol, it's a little murkier. The reason the protocol trumps 
the query is so that queries can be easily re-targeted against other 
graphs, without having to parse out any dataset given in the query.

If we go with Option 2 above then this is still easy, since the subquery 
can't specify a dataset.

If we allow subqueries to specify a dataset, and the protocol also 
specifies a dataset, it's unclear what should happen:

Option A: Protocol trumps dataset. This seems inconsistent since we're 
allowing subqueries to have different datasets then their parent, but 
now all of a sudden the protocol forces both parent & child to share the 
same subquery. It's hard to imagine a situation where

   # something useful by sub-querying from g2 instead of g1
   SELECT * FROM :g1 { ... { SELECT * FROM :g2 { ... } } }

all of a sudden makes sense when the protocol forces both parts to be 
issued against g3. That is, there's no way (right now) for the protocol 
to say, override g1 with g3 and g2 with g4.

Option B: Protocol trumps dataset for main query, but datasets 
explicitly in subqueries trump all. This is inconsistent because now the 
protocol can partially retarget a query but can't touch subqueries. 
That's weird.

Option C: Explicitly prohibit the case where the protocol supplies a 
dataset for a query that contains a subquery that explicitly specifies 
its dataset. This works around the problem but is sort of a strange 
prohibition.


It seems to me that ARQ's behavior is simple and avoids this problem, 
but I'm not sure at what cost. My natural inclination is that its 
valuable for queries & their subqueries to be able to target different 
graphs.

Current recommendation? Unsure.

Suggested next steps? Determine whether we have reasonable use cases to 
require that subqueries can target different datastes from parent queries.


Lee
Received on Friday, 8 May 2009 04:41:25 UTC