Re: comments on SPARQL Query Language for RDF from Richard Newman on 2007-05-28 (public-rdf-dawg-comments@w3.org from May 2007)

From: Richard Newman <rnewman@franz.com>
Date: Sun, 27 May 2007 23:54:58 -0700
To: Bob MacGregor <bmacgregor@siderean.com>
Cc: public-rdf-dawg-comments@w3.org, Eric Prud'hommeaux <eric@w3.org>
Message-Id: <9828455C-1642-4216-A853-001658669229@franz.com>
Bob, DAWG, folks,

I'm going to weigh in here, because I have implementation experience,  
and I'm practically mentioned in a FROM NAMED clause. My apologies in  
advance to the DAWG for stepping on toes.

On  24 May 2007, at 7:29 AM, Bob MacGregor wrote:

> At one point in SPARQL's evolution, the language introduced a  
> SOURCE operator that allowed for a
> context argument that could be either a variable or a constant.   
> The SOURCE construct effectively
> treats contexts as first-class entities.  The currently-adopted  
> named graphs notion treats contexts
> as second-class objects.  The SOURCE operator is consistent with a  
> fully-functional quad
> implementation; the named graph notion is much more limited.  The  
> principal advantage of the
> named graph notion is that it is only a small extension beyond the  
> traditional RDF spec.

In what way is GRAPH limited? It's merely a syntactic extension of  
Turtle to allow a fourth field to be specified:

GRAPH ?foo {
   ?x ?y ?z .
   GRAPH x:y {
     ?a ?b ?c .
   }
}

is fine. (Indeed, in AllegroGraph we expand that into quads internally:

?foo ?x ?y ?z .
x:y  ?a ?b ?c .)

If your implementation allows you to use an unrestricted dataset  
(i.e., you don't have to enumerate your graphs/sources using FROM  
NAMED), I can't even see a problem there... and the dataset issue  
applies equally to SOURCE.

SOURCE heavily restricts a SPARQL implementation, forcing it to track  
provenance (whither programmatically generated triples?), or fail  
queries that try to use SOURCE. GRAPH provides instead a generic  
fourth field; the particular endpoint can choose what that field is  
used for.

I'd choose flexibility over specificity.  GRAPH > SOURCE.

> However, major commercial vendors are implementing full support for  
> quads.  Franz's AllegroGraph has
> a quad implementation (actually, they mentioned quints, but the  
> fifth argument is internal),
> Kowari/Tucana implements full quads, and Siderean's Seamark Navigator
> (my own company) has full quads.  The reason for this is that full  
> quads enable performant implementations of
> provenance information and named graphs do not.

I should point out that, in AllegroGraph, the fourth field of the  
quad is used to implement named graphs (though it can be used for  
other things, too), and the AllegroGraph SPARQL interface uses GRAPH  
to query the fourth field: quad-fourth-fields and named graphs *are  
the same thing*.

If you want to use the graph field to track provenance, you can: when  
you're querying through SPARQL on AllegroGraph, and tracking  
provenance in the graph argument, GRAPH acts exactly like SOURCE --  
but you can use it for other things, too, if you'd prefer to use it  
for access control, or geocoding, or inference.

I have personally implemented a system to do full access control and  
provenance using the named graph support in AllegroGraph. I don't see  
any way in which "full quads" are different to having a graph slot in  
a 'triple': both of them give an additional field in which to store  
information. All "named graphs" is is a suggestion about how you  
might want to use the fourth field: to cluster triples together  
"under" some URI. SOURCE, on the other hand, is a *requirement* that  
an implementation track provenance in a fourth (or fifth) field.

I suspect that you are blinkered by one possible approach to named  
graphs: having a separate model per graph, with performance penalties  
when crossing between models, or using many models. One could just as  
easily build an RDF store that has a separate model for each  
property: that doesn't mean that the design of SPARQL is wrong, only  
that that particular implementation does not adequately support the  
use case you are envisioning.

> What we have here is a case were the serious commercial vendors,  
> who care about performance,
> have chosen a direction different than the one adopted by  
> SPARQL.    My suggestion of to resurrect
> the SOURCE construct in SPARQL.

We added flexible named graphs in AllegroGraph 2.0 because customers  
wanted them. AllegroGraph's design made it easy to do so, and the  
graph field is fully indexed, just like s/p/o. Some customers want to  
use the graph field for other purposes, and we facilitate that, but  
"graph" is a good default interpretation of the fourth field of a  
triple.

Can you give a use case or two that SOURCE allows, but GRAPH does  
not? I believe that that is a motivating factor for the WG. I'd also  
love to hear ways in which AllegroGraph -- one of your mentioned  
"serious commercial" products -- is moving away from the conceptual  
direction of SPARQL, because I put a fair amount of effort into  
ensuring that it does not.

> In choosing named graphs, it has chosen
> an impoverished solution that satisfies only one aspect of  
> provenance,  while major vendors are
> taking a more enlightened approach, full quads, that supports all  
> manner of provenance information.
> In the long run, performance always wins out; quads are going to  
> make named graphs a footnote.

Unless I'm misunderstanding you, I think you're arguing across  
yourself. Named graphs are not necessarily different to quads: in  
AllegroGraph, for instance, they are exactly the same. Think of named  
graphs as merely a suggested application of quads, and your objection  
goes away.

I still fail to see how SOURCE is more "enlightened" or performant  
than GRAPH. I look forward to your explanation.

Regards,

-Richard
Received on Monday, 28 May 2007 06:55:19 UTC