Re: comments on SPARQL Query Language for RDF

PFPS has recently made an assessment of the SPARQL specification, and  
found it wanting.  Additionally,
I chatted with Pat Hayes and he confirmed some of my own assessments  
about the language.  So I will
take this occasion to reiterate my major objections to the SPARQL  
language.

The most serious are:

- There is no model-theoretic definition of SPARQL semantics.  This  
is not because there could not be one,
but because the committee has chosen not to produce one.

- The notion of named graphs does not support a performant  
implementation of security.  In contrast, a quad
approach that treats  context "argument" as a first-class entity does  
admit an efficient implementation.

- The SPARQL UNBOUND operator is strictly less expressive than a  
negation as failure operator (contrary to what has
been asserted by one of the prominent SPARQL proponents).  Also,  
unlike the rest of the language, UNBOUND
may not admit of a model-theoretic semantics (I don't claim to know  
if it is or is not).

Below I will discuss each of these in turn:

MODEL-THEORETIC SEMANTICS

The current SPARQL semantics apparently derives from an algebraic  
specification that says what you
get when you run a SPARQL query, rather than what the answer actually  
"means".  The SPARQL spec
is procedural rather than declarative, analogous to a programming  
language spec (like a definition of the
Java semantics. If it were not possible to craft a declarative  
semantics, then a procedural semantics would
be acceptable.  But a declarative semantics is possible.

For most of the SPARQL language, a declarative semantics can be  
derived in straightforward fashion.  The two
constructs that pose the most difficulty are OPTIONAL and UNBOUND.   
The semantics of OPTIONAL has
been compared to that of the SQL outer join.  This analogy is not  
well-founded.  The SQL outer join
is, as far as I can determine, a procedural operator.  What I mean by  
that is that the order of
evaluation of SQL outer joins matters; if you permute two of them,  
the answer may change.  This is
not true of OPTIONAL, i.e., OPTIONAL is better-behaved.    I produced  
an outline for a
declarative semantics for OPTIONAL quite a while back (during the  
time when OPTIONAL semantics
were being debated on the SPARQL emails).  I cite this not because I  
claim that my semantics should
be adopted, but because it demonstrates that a declarative semantics  
for OPTIONAL is feasible.

The UNBOUND operator may or may not admit of a declarative  
semantics.  However, while OPTIONAL is
a well-crafted operator, UNBOUND is a hack that the language would be  
better off without (see discussion
below).  Thus, my suggestion would be to drop it from the language,  
espcially if a declarative semantics for it
isn't possible.

QUADS AND SECURITY

At the (not-quite-concluded) Semantic Technology conference, Eric  
Monk and Kevin Smith presented
an assessment of 5 different approaches to implementing security in  
an RDF system.  I suggested a
sixth at the conclusion of the talk.  Two of the six are consistent  
with SPARQL; the others are not.   One
of these two relies on triple reification; the other relies on named  
graphs.  These two each incur heavy penalties
in terms of both space and performance.  In other words, among the  
six, they are the least desirable.

At one point in SPARQL's evolution, the language introduced a SOURCE  
operator that allowed for a
context argument that could be either a variable or a constant.  The  
SOURCE construct effectively
treats contexts as first-class entities.  The currently-adopted named  
graphs notion treats contexts
as second-class objects.  The SOURCE operator is consistent with a  
fully-functional quad
implementation; the named graph notion is much more limited.  The  
principal advantage of the
named graph notion is that it is only a small extension beyond the  
traditional RDF spec.

However, major commercial vendors are implementing full support for  
quads.  Franz's AllegroGraph has
a quad implementation (actually, they mentioned quints, but the fifth  
argument is internal),
Kowari/Tucana implements full quads, and Siderean's Seamark Navigator
(my own company) has full quads.  The reason for this is that full  
quads enable performant implementations of
provenance information and named graphs do not.  Security is only one  
aspect of provenance; I cite
it because any serious implementation of a triple/quad store will  
include a performant security component.
AllegroGraphs, Tucana, and Seamark all have security built-in.

What we have here is a case were the serious commercial vendors, who  
care about performance,
have chosen a direction different than the one adopted by SPARQL.     
My suggestion of to resurrect
the SOURCE construct in SPARQL.

UNBOUND

The introduction of a negation-as-failure construct was considered  
and rejected my the SPARQL committee.
Instead, they invented a hack, the UNBOUND operator.  This is a  
mistake.  UNBOUND is strictly
less expressive than UNSAID (or whatever you may call the negation-as- 
failure) operator.  In Seamark,
we implement a closed-world version of universal quantification using  
a double negation (e.g., there does
not exist value that does not have type X).  This construct cannot be  
emulated using UNBOUND (at the
conference, Pat verbally agreed with this claim).

A model-theoretic semantics can be crafted for UNSAID.  There may or  
may not be one for UNBOUND.  In
any case, UNSAID explicitly endorses negation as failure, while  
UNBOUND does so implicitly and inadequately.
There is a strong need for negation as failure in the language; the  
question is, should it be endorsed
openly and honestly, or should the language pretend not to have it,  
but then introduce a hack to allow for its partial
support?

CONCLUSIONS

I have other objects to the SPARQL language, but these three are the  
most serious, and the lack of
a declarative semantics is the most serious of the three.  The  
solution is to produce a model theoretic
semantics, and to modify any aspects of SPARQL that are inconsistent  
with a declarative semantics.

SPARQL has had a long and somewhat painful evolution, owing in part  
to the immaturity of RDF.  Its
hard to design a language with an incomplete set of use cases.  In  
choosing named graphs, it has chosen
an impoverished solution that satisfies only one aspect of  
provenance,  while major vendors are
taking a more enlightened approach, full quads, that supports all  
manner of provenance information.
In the long run, performance always wins out; quads are going to make  
named graphs a footnote.
My suggestion is to reintroduce the SOURCE operator into SPARQL.   
Just as OWL has degrees
of adoption, the name-graph RDF stores could support SOURCE with  
constant parameters, while the
full-quad stores could additionally support variable arguments to  
SOURCE.

The UNBOUND operator should be dropped from the language.  Vendors  
can instead implement it
as a a "computed predicate", and its reasonable to define a standard  
namespace for it, e.g.,
'sparql:unbound'.   I personally endorse introducing UNSAID, but I  
don't imagine that actually happening
soon.  Note: The Seamark query language includes not only UNSAID, but  
also IN and GROUPBY -- we
consider SQL a use case for what SPARQL should be).

Cheers, Bob

Received on Thursday, 24 May 2007 14:29:42 UTC