Re: Streaming version of CONSTRUCT from Seaborne, Andy on 2005-02-21 (public-rdf-dawg@w3.org from January to March 2005)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Mon, 21 Feb 2005 17:31:35 +0000
To: Bob MacGregor <bmacgregor@siderean.com>
CC: public-rdf-dawg@w3.org
Message-ID: <421A1AF7.7070304@hp.com>
Bob MacGregor wrote:
> Andy,
> 
> Seaborne, Andy wrote:
> 
> 
>>Bob MacGregor wrote:
>>
>>
>>>We have an application that is based around, among other things, 
>>>efficient
>>>retrieval of tree-shaped RDF graphs.  We tried programming this using
>>>SELECT combined with OPTIONAL clauses, but the resultant cartesian
>>>products made that  approach completely infeasible for large datasets.
>>>
>>>Our solution was to return an iterator that generates a tree-shaped 
>>>graph
>>>at each iteration.  The closest notion to this in SPARQL is the
>>>CONSTRUCT clause -- using SPARQL, we could specify a tree-shaped 
>>>template,
>>>and return back a graph containing our trees.  However, this would
>>>also have very bad performance, because in our applications, there
>>>may be 10,000 trees, but we only want to see the first few (10-20).
>>>And then maybe a few more.
>>>
>>>The problem with CONSTRUCT as currently defined is that it
>>>only returns a single graph.  If it were redefined to return a
>>>stream of graphs (one per template), then we could get exactly the 
>>>efficiency we are
>>>looking for.
>>>
>>>Also, it would be nice if LIMIT were defined to apply to CONSTRUCT
>>>as well as select.  Syntactially, it appears to be legal, but the 
>>>spec appears
>>>not to define what it does.   In this case, LIMIT should specify a 
>>>bound on how
>>>many times the template can be instantiated.
>>>Cheers, Bob
>>>-- 
>>>
>>>Bob MacGregor
>>>Chief Scientist
>>
>>
>>Bob,
>>
>>We had a similar situation in a system we built a couple of years 
>>ago.  The tree we wanted to handle coudl not be expressed in CONSTRUCT 
>>as there were of variable depths.
>>
>>It might be that DESCRIBE is a better choice for this particular 
>>problem - it does not do what you want directly but it can be used to 
>>do it.
> 
> 
> Our trees all follow the same "template".  However, I hadn't looked 
> closely at the template notion within CONSTRUCT.
> Now that I have, I seen that it is practically useless for serious 
> applications, due to the fact that it disallows unbound
> variables.  The SPARQL assumption, which is simply wrong for the general 
> case, is that missing values are somehow erroneous.
> In RDF, unlike relational databases, diversity of data should be 
> presumed to be the norm, and the language should be
> constructed to handle such diversity.  Much better would be a semantics 
> for CONSTRUCT that omitted triples that contained
> unbound variables, but retained triples that are fully bound, i.e., 
> partially instantiated templates would be legal.


To quote the current working draft:

"""
CONSTRUCT with a template
...
If any such instantiation produces a triple containing an unbound variable, or 
an illegal RDF construct (such as a literal in subject or predicate position) 
then that triple is not included in the RDF graph
"""

Only the triple involved, not the whole template, is skipped on unbound variables.

	Andy


   This
> represents a strict increase in expressive power, since the WHERE clause 
> has full control of which parts should be optional,
> and which required.
> 
> Note: This is an issue where a smart implementor will simply ignore the 
> restriction on unbound variables in CONSTRUCT
> templates.  By doing so, she is increasing the usability of her product, 
> but still obeying the spec for all legal SPARQL
> queries.  That gives her and edge, at the expense of a loss in 
> consistency of SPARQL implementations.  In general,
> any time the DAWG members opt away from orthogonality and generality, 
> they run the risk of diluting the uniformity
> of the final product.  For example, our implementation will have 
> negation, ORDER BY, etc. (essentially, anything in SQL)
> whether or not its defined in SPARQL, simply because real applications 
> benefit by these constructs.
> 
> 
>>First, issue a SELECT query to find the graph nodes (assumed to have 
>>URIs) of items of interest then issue one DESCRIBE per URI (because 
>>each result is a single graph).  Your server has to have processing to 
>>calculate the DESCRIBE result as the required tree.  The shape of 
>>these results is application-domain dependent.
>>
>>Depending on your connection technology to the SPARQL service this may 
>>work as there are one DESCRIBE request per item unless either your 
>>application can pick apart merged graphs (application dependent on the 
>>data forms).
>>
>>- - - - - - -
>>
>>We should specify whether LIMIT applies to CONSTRUCT (and DESCRIBE) or 
>>not.  I thought we did decide that it did at the last F2F (argument: 
>>it makes no sense to exclude) but it has not made the document yet.  
>>mea culpa.
>>
>>- - - - - - -
>>
>>The other issue is the general whether CONSTRUCT should return 
>>multiple graphs, not a merged one.  This is a case of there being 
>>usages for both - mergeing can reduce the size of results where 
>>templates create duplicate triples.
>>
>>It does seem to be a tension between features of the connection 
>>technology where it can handle many small requests and/or package 
>>multiple SPARQL requests into a single protocol unit vs a more 
>>complicated query language.
> 
> 
> There is indeed a tension.  However, the resolution is apparently not 
> consistent.  Here, the spec has opted on the side
> of small applications that can't scale.  When ORDER BY was disallowed, 
> the spec has opted in favor of applications
> so large (or lengthly) that sorting is impractical (and didn't give 
> users the option to use their own judgement).
> 
> It would be rather easy to add a keyword that controls whether or not 
> one or many graphs should be generated when
> using CONSTRUCT.
> 
> 
>>    Andy
> 
> 
> Bob
>
Received on Monday, 21 February 2005 17:32:03 UTC