Re: comments on SPARQL Query Language for RDF from Bob MacGregor on 2007-05-28 (public-rdf-dawg-comments@w3.org from May 2007)

From: Bob MacGregor <bmacgregor@siderean.com>
Date: Mon, 28 May 2007 13:39:35 -0700
To: Richard Newman <rnewman@franz.com>
Cc: public-rdf-dawg-comments@w3.org, Eric Prud'hommeaux <eric@w3.org>
Message-Id: <2012C221-748F-4F43-BA56-7D08F60B8DF1@siderean.com>
Hi Richard,

Point 1:  I admit to a small mistake.  What I liked about the  
erstwhile SOURCE construct was that it allowed the
fourth (context) argument to be a variable.  The supposition was that  
the value of that variable would
be an indicator of the "source" of the matching statements, but there  
was no machinery that would have
prevented us from attaching arbitrary provenance to the resources  
bound to that variable.  So, our
intent was to use SOURCE as a springboard for full four-valued  
statements.  My mistake was in not mentioning
that we were championing the syntax, rather than the rather  
restricted usage that was thought to be associated
with it.

Point 2: Any system that assumes that contexts have to have names is  
ultimately non-scalable.  We have
empirical data on this (which I have mentioned in prior conversations  
with Franz' Jans Aasman), but I'm
not going to dwell on it here.  The basic principle is that the  
semantics lies with the provenance attached to the
contexts, not in the names of the contexts.  That means that (1) the  
quad store must admit blank nodes as
contexts as well as URIs (I assume that AllegroGraph is fine with  
that), and (2) the query language must allow
contexts to be variables (which is where I believe SPARQL falls on  
its face).

Point 3:  All examples I've seen in SPARQL show finite enumerations  
of named graphs in FROM NAMED
clauses.  Some of our applications work with tens of thousands of  
different contexts (and we are just warming
up).  If the GRAPH construct remedies that, then that would be good  
news.  However, there is no example in
     http://www.w3.org/TR/rdf-sparql-query
or anywhere else that I've happened upon that illustrates the use of  
the context argument as a variable,
bound to provenance restrictions (e.g., using a dc:source or dc:date  
property).  If you can show me an example,
especially one that I can run through SPARQLer,  I would appreciate it.

Cheers, Bob

On May 27, 2007, at 2354, Richard Newman wrote:

> Bob, DAWG, folks,
>
> I'm going to weigh in here, because I have implementation  
> experience, and I'm practically mentioned in a FROM NAMED clause.  
> My apologies in advance to the DAWG for stepping on toes.
>
> On  24 May 2007, at 7:29 AM, Bob MacGregor wrote:
>
>> At one point in SPARQL's evolution, the language introduced a  
>> SOURCE operator that allowed for a
>> context argument that could be either a variable or a constant.   
>> The SOURCE construct effectively
>> treats contexts as first-class entities.  The currently-adopted  
>> named graphs notion treats contexts
>> as second-class objects.  The SOURCE operator is consistent with a  
>> fully-functional quad
>> implementation; the named graph notion is much more limited.  The  
>> principal advantage of the
>> named graph notion is that it is only a small extension beyond the  
>> traditional RDF spec.
>
> In what way is GRAPH limited? It's merely a syntactic extension of  
> Turtle to allow a fourth field to be specified:
>
> GRAPH ?foo {
>   ?x ?y ?z .
>   GRAPH x:y {
>     ?a ?b ?c .
>   }
> }
>
> is fine. (Indeed, in AllegroGraph we expand that into quads  
> internally:
>
> ?foo ?x ?y ?z .
> x:y  ?a ?b ?c .)
>
> If your implementation allows you to use an unrestricted dataset  
> (i.e., you don't have to enumerate your graphs/sources using FROM  
> NAMED), I can't even see a problem there... and the dataset issue  
> applies equally to SOURCE.
>
> SOURCE heavily restricts a SPARQL implementation, forcing it to  
> track provenance (whither programmatically generated triples?), or  
> fail queries that try to use SOURCE. GRAPH provides instead a  
> generic fourth field; the particular endpoint can choose what that  
> field is used for.
>
> I'd choose flexibility over specificity.  GRAPH > SOURCE.
>
>> However, major commercial vendors are implementing full support  
>> for quads.  Franz's AllegroGraph has
>> a quad implementation (actually, they mentioned quints, but the  
>> fifth argument is internal),
>> Kowari/Tucana implements full quads, and Siderean's Seamark Navigator
>> (my own company) has full quads.  The reason for this is that full  
>> quads enable performant implementations of
>> provenance information and named graphs do not.
>
> I should point out that, in AllegroGraph, the fourth field of the  
> quad is used to implement named graphs (though it can be used for  
> other things, too), and the AllegroGraph SPARQL interface uses  
> GRAPH to query the fourth field: quad-fourth-fields and named  
> graphs *are the same thing*.
>
> If you want to use the graph field to track provenance, you can:  
> when you're querying through SPARQL on AllegroGraph, and tracking  
> provenance in the graph argument, GRAPH acts exactly like SOURCE --  
> but you can use it for other things, too, if you'd prefer to use it  
> for access control, or geocoding, or inference.
>
> I have personally implemented a system to do full access control  
> and provenance using the named graph support in AllegroGraph. I  
> don't see any way in which "full quads" are different to having a  
> graph slot in a 'triple': both of them give an additional field in  
> which to store information. All "named graphs" is is a suggestion  
> about how you might want to use the fourth field: to cluster  
> triples together "under" some URI. SOURCE, on the other hand, is a  
> *requirement* that an implementation track provenance in a fourth  
> (or fifth) field.
>
> I suspect that you are blinkered by one possible approach to named  
> graphs: having a separate model per graph, with performance  
> penalties when crossing between models, or using many models. One  
> could just as easily build an RDF store that has a separate model  
> for each property: that doesn't mean that the design of SPARQL is  
> wrong, only that that particular implementation does not adequately  
> support the use case you are envisioning.
>
>> What we have here is a case were the serious commercial vendors,  
>> who care about performance,
>> have chosen a direction different than the one adopted by  
>> SPARQL.    My suggestion of to resurrect
>> the SOURCE construct in SPARQL.
>
> We added flexible named graphs in AllegroGraph 2.0 because  
> customers wanted them. AllegroGraph's design made it easy to do so,  
> and the graph field is fully indexed, just like s/p/o. Some  
> customers want to use the graph field for other purposes, and we  
> facilitate that, but "graph" is a good default interpretation of  
> the fourth field of a triple.
>
> Can you give a use case or two that SOURCE allows, but GRAPH does  
> not? I believe that that is a motivating factor for the WG. I'd  
> also love to hear ways in which AllegroGraph -- one of your  
> mentioned "serious commercial" products -- is moving away from the  
> conceptual direction of SPARQL, because I put a fair amount of  
> effort into ensuring that it does not.
>
>> In choosing named graphs, it has chosen
>> an impoverished solution that satisfies only one aspect of  
>> provenance,  while major vendors are
>> taking a more enlightened approach, full quads, that supports all  
>> manner of provenance information.
>> In the long run, performance always wins out; quads are going to  
>> make named graphs a footnote.
>
> Unless I'm misunderstanding you, I think you're arguing across  
> yourself. Named graphs are not necessarily different to quads: in  
> AllegroGraph, for instance, they are exactly the same. Think of  
> named graphs as merely a suggested application of quads, and your  
> objection goes away.
>
> I still fail to see how SOURCE is more "enlightened" or performant  
> than GRAPH. I look forward to your explanation.
>
> Regards,
>
> -Richard
>
>

Bob MacGregor
Chief Scientist
Siderean Software, Inc.
310.647.5690
bmacgregor@siderean.com
Received on Monday, 28 May 2007 20:39:45 UTC