Re: comments on SPARQL Query Language for RDF from Bob MacGregor on 2007-05-31 (public-rdf-dawg-comments@w3.org from May 2007)

From: Bob MacGregor <bmacgregor@siderean.com>
Date: Thu, 31 May 2007 11:07:49 -0700
To: Jeen Broekstra <jeen.broekstra@aduna-software.com>
Cc: Pat Hayes <phayes@ihmc.us>, public-rdf-dawg-comments@w3.org, Eric Prud'hommeaux <eric@w3.org>, Richard Newman <rnewman@franz.com>
Message-Id: <260A8F86-45CA-456E-AF67-FE564A1E24FA@siderean.com>
I think that the fundamental problem relates to the fact that the  
SPARQL language is already
obsolete even before it has been finished.  This is because current  
RDF, and the graph-based notion
that it promotes, is also obsolete.

If one is thinking in terms of quads (e.g., as in Sesame and some  
major vendor products) then the notion of
blank nodes in context position makes perfect sense.  However, if you  
confine your thinking to triples, as Pat has
done (correctly in the context of RDF/SPARQL), then I guess that  
graph names may be necessary.

Why is RDF obsolete?  I can point to three serious drawbacks.  The  
most immediate is that RDF does not provide
for a practical means for storing models containing large numbers of  
graphs.  The most common way to serialize/store
RDF/XML is as a number of individual graphs, e.g., thousands of  
graphs.  Much better would be an
N4 or NQuads syntax, or the addition of a ":context" attribute to RDF  
(a sibling to the ":resource" attribute).
Right now, there is no acceptable standard (that I'm aware of) for  
transmitting models containing large
numbers of graphs.

The second drawback is at this point more oblique.  Somewhat over a  
year ago, we implemented a
quad compression scheme that not only saves significant space in the  
presence of large numbers of
graphs, but also resulted in order of magnitude performance  
improvement on models with a few million
quads.  Analysis showed that the performance differential was roughly  
linear in the number of graphs (one graph
per document), so for larger applications, there would have been  
several orders of magnitude difference
in performance.    We have now embedded the compression into the quad  
store, i.e., we can't turn it off
anymore. The compression is lossless except that it does not preserve  
graphs names (since we use
blank nodes for contexts/graphs, for us its not a loss).  While I  
expect it may take a while for the compression
scheme to become widespread, performance always wins out.

The third drawback is the difference in mindset.  Once you have  
quads, combined with aggressive
use of multiple dimensions of provenance, the notion of graphs  
introduces a dissonance that makes it harder
to visualize what is going on.  Take the notion of the "default"  
graph containing all of the triples from all of the
graphs.  If we attach security information to each graph (which we  
often do), the the only time the "all triples" notion
makes sense is when you run at system high; for all normal cases,  
queries only see a subset of the triples
belonging to the union of the graphs.  More preposterous is the FROM  
NAMED construct.  This makes sense
only if you have a very small number of graphs, and if the names of  
the graphs are actually meaningful (not
normally the case when you are seriously into provenance).

Richard Newman's suggestion of FROM NAMED * provides a solution,  
except that the right syntax for that would
be to eliminate FROM NAMED entirely and assume the star holds by  
default.  And we will use GRAPH ?cxt
to reference contexts, except that our  own product will permit blank  
nodes to bind to the ?cxt argument.

Cheers, Bob

On May 31, 2007, at 0050, Jeen Broekstra wrote:

> Pat Hayes wrote:
>>
>>> Hi Pat,
>>>
>>> On May 29, 2007, at 1954, Pat Hayes wrote:
>>>
>>>> <snip>
>>>>
>>>
>>>> However, I am at a loss to understand how you refer to these  
>>>> 150,000
>>>> graphs if you have no way to name them. How do you even know how  
>>>> many
>>>> you have?
>>>>
>>>
>>> Each of the graphs consists of triples extracted from a different
>>> document.  The document might be identified by a file name, or a
>>> message ID,
>>> a documentum identifier, or whatever.  The quads for that document
>>> share a common context argument; a blank node.  The same
>>> blank node appears in subject position to record provenance  
>>> assertions
>>> about the graph (which document, which extractor used,
>>> time of extraction, etc).
>>
>> That works as long as everything is inside the intended scope of the
>> blank node identifier, which is usually a document. BUt a query is  
>> not
>> usually inside the same scope as the graph(s) being queried, so to  
>> use
>> the blank node as an identifier in the query is (usually) impossible.
>
> Allow me to jump in at this point with my personal POV.
>
> I think you overlook the fact that you can address blank nodes
> 'existentially' from a query, e.g. "give me the triples from the graph
> identified with the source property ex:foo and value ex:bar" :
>
>  SELECT ?x ?y ?z
>  WHERE {
>     ?g ex:foo ex:bar.
>     GRAPH ?g { ?x ?y ?z .}
>     }
>
> Surely in this kind of pattern ?g could well be allowed to be bound  
> to a
> blank node. However, this is currently not possible in SPARQL  
> because it
> explicitly requires that a graph name is a URI.
>
> FWIW we have implementation experience with allowing blank nodes here,
> because that is exactly what Sesame does; we call the mechanism
> 'context' rather than 'named graph', by the way since the notion of
> 'naming' indeed tends to suggest that it is an actual *name*.
>
> I don't think it's a matter of scope, because the scope of the blank
> node is still the original dataset, it is not directly addressed from
> the query.
>
>
> Cheers,
>
> Jeen
> -- 
> Aduna - Guided Exploration
> www.aduna-software.com
>
> Prinses Julianaplein 14-b
> 3817 CS Amersfoort
> The Netherlands
> +31-33-4659987 (office)

Bob MacGregor
Chief Scientist
Siderean Software, Inc.
310.647.5690
bmacgregor@siderean.com
Received on Thursday, 31 May 2007 18:08:25 UTC