Re: Streaming version of CONSTRUCT from Bob MacGregor on 2005-02-21 (public-rdf-dawg@w3.org from January to March 2005)

From: Bob MacGregor <bmacgregor@siderean.com>
Date: Mon, 21 Feb 2005 08:10:41 -0800
To: andy.seaborne@hp.com
CC: public-rdf-dawg@w3.org
Message-ID: <421A0801.3000004@siderean.com>
Andy,

Seaborne, Andy wrote:

> Bob MacGregor wrote:
>
>> We have an application that is based around, among other things, 
>> efficient
>> retrieval of tree-shaped RDF graphs.  We tried programming this using
>> SELECT combined with OPTIONAL clauses, but the resultant cartesian
>> products made that  approach completely infeasible for large datasets.
>>
>> Our solution was to return an iterator that generates a tree-shaped 
>> graph
>> at each iteration.  The closest notion to this in SPARQL is the
>> CONSTRUCT clause -- using SPARQL, we could specify a tree-shaped 
>> template,
>> and return back a graph containing our trees.  However, this would
>> also have very bad performance, because in our applications, there
>> may be 10,000 trees, but we only want to see the first few (10-20).
>> And then maybe a few more.
>>
>> The problem with CONSTRUCT as currently defined is that it
>> only returns a single graph.  If it were redefined to return a
>> stream of graphs (one per template), then we could get exactly the 
>> efficiency we are
>> looking for.
>>
>> Also, it would be nice if LIMIT were defined to apply to CONSTRUCT
>> as well as select.  Syntactially, it appears to be legal, but the 
>> spec appears
>> not to define what it does.   In this case, LIMIT should specify a 
>> bound on how
>> many times the template can be instantiated.
>> Cheers, Bob
>> -- 
>>
>> Bob MacGregor
>> Chief Scientist
>
>
> Bob,
>
> We had a similar situation in a system we built a couple of years 
> ago.  The tree we wanted to handle coudl not be expressed in CONSTRUCT 
> as there were of variable depths.
>
> It might be that DESCRIBE is a better choice for this particular 
> problem - it does not do what you want directly but it can be used to 
> do it.

Our trees all follow the same "template".  However, I hadn't looked 
closely at the template notion within CONSTRUCT.
Now that I have, I seen that it is practically useless for serious 
applications, due to the fact that it disallows unbound
variables.  The SPARQL assumption, which is simply wrong for the general 
case, is that missing values are somehow erroneous.
In RDF, unlike relational databases, diversity of data should be 
presumed to be the norm, and the language should be
constructed to handle such diversity.  Much better would be a semantics 
for CONSTRUCT that omitted triples that contained
unbound variables, but retained triples that are fully bound, i.e., 
partially instantiated templates would be legal.  This
represents a strict increase in expressive power, since the WHERE clause 
has full control of which parts should be optional,
and which required.

Note: This is an issue where a smart implementor will simply ignore the 
restriction on unbound variables in CONSTRUCT
templates.  By doing so, she is increasing the usability of her product, 
but still obeying the spec for all legal SPARQL
queries.  That gives her and edge, at the expense of a loss in 
consistency of SPARQL implementations.  In general,
any time the DAWG members opt away from orthogonality and generality, 
they run the risk of diluting the uniformity
of the final product.  For example, our implementation will have 
negation, ORDER BY, etc. (essentially, anything in SQL)
whether or not its defined in SPARQL, simply because real applications 
benefit by these constructs.

>
> First, issue a SELECT query to find the graph nodes (assumed to have 
> URIs) of items of interest then issue one DESCRIBE per URI (because 
> each result is a single graph).  Your server has to have processing to 
> calculate the DESCRIBE result as the required tree.  The shape of 
> these results is application-domain dependent.
>
> Depending on your connection technology to the SPARQL service this may 
> work as there are one DESCRIBE request per item unless either your 
> application can pick apart merged graphs (application dependent on the 
> data forms).
>
> - - - - - - -
>
> We should specify whether LIMIT applies to CONSTRUCT (and DESCRIBE) or 
> not.  I thought we did decide that it did at the last F2F (argument: 
> it makes no sense to exclude) but it has not made the document yet.  
> mea culpa.
>
> - - - - - - -
>
> The other issue is the general whether CONSTRUCT should return 
> multiple graphs, not a merged one.  This is a case of there being 
> usages for both - mergeing can reduce the size of results where 
> templates create duplicate triples.
>
> It does seem to be a tension between features of the connection 
> technology where it can handle many small requests and/or package 
> multiple SPARQL requests into a single protocol unit vs a more 
> complicated query language.

There is indeed a tension.  However, the resolution is apparently not 
consistent.  Here, the spec has opted on the side
of small applications that can't scale.  When ORDER BY was disallowed, 
the spec has opted in favor of applications
so large (or lengthly) that sorting is impractical (and didn't give 
users the option to use their own judgement).

It would be rather easy to add a keyword that controls whether or not 
one or many graphs should be generated when
using CONSTRUCT.

>
>     Andy

Bob
Received on Monday, 21 February 2005 16:12:23 UTC