- From: Bob MacGregor <bmacgregor@siderean.com>
- Date: Thu, 24 May 2007 07:29:15 -0700
- To: Peter F. Patel-Schneider <pfps@research.bell-labs.com>
- Cc: public-rdf-dawg-comments@w3.org, eric@w3.org
PFPS has recently made an assessment of the SPARQL specification, and found it wanting. Additionally, I chatted with Pat Hayes and he confirmed some of my own assessments about the language. So I will take this occasion to reiterate my major objections to the SPARQL language. The most serious are: - There is no model-theoretic definition of SPARQL semantics. This is not because there could not be one, but because the committee has chosen not to produce one. - The notion of named graphs does not support a performant implementation of security. In contrast, a quad approach that treats context "argument" as a first-class entity does admit an efficient implementation. - The SPARQL UNBOUND operator is strictly less expressive than a negation as failure operator (contrary to what has been asserted by one of the prominent SPARQL proponents). Also, unlike the rest of the language, UNBOUND may not admit of a model-theoretic semantics (I don't claim to know if it is or is not). Below I will discuss each of these in turn: MODEL-THEORETIC SEMANTICS The current SPARQL semantics apparently derives from an algebraic specification that says what you get when you run a SPARQL query, rather than what the answer actually "means". The SPARQL spec is procedural rather than declarative, analogous to a programming language spec (like a definition of the Java semantics. If it were not possible to craft a declarative semantics, then a procedural semantics would be acceptable. But a declarative semantics is possible. For most of the SPARQL language, a declarative semantics can be derived in straightforward fashion. The two constructs that pose the most difficulty are OPTIONAL and UNBOUND. The semantics of OPTIONAL has been compared to that of the SQL outer join. This analogy is not well-founded. The SQL outer join is, as far as I can determine, a procedural operator. What I mean by that is that the order of evaluation of SQL outer joins matters; if you permute two of them, the answer may change. This is not true of OPTIONAL, i.e., OPTIONAL is better-behaved. I produced an outline for a declarative semantics for OPTIONAL quite a while back (during the time when OPTIONAL semantics were being debated on the SPARQL emails). I cite this not because I claim that my semantics should be adopted, but because it demonstrates that a declarative semantics for OPTIONAL is feasible. The UNBOUND operator may or may not admit of a declarative semantics. However, while OPTIONAL is a well-crafted operator, UNBOUND is a hack that the language would be better off without (see discussion below). Thus, my suggestion would be to drop it from the language, espcially if a declarative semantics for it isn't possible. QUADS AND SECURITY At the (not-quite-concluded) Semantic Technology conference, Eric Monk and Kevin Smith presented an assessment of 5 different approaches to implementing security in an RDF system. I suggested a sixth at the conclusion of the talk. Two of the six are consistent with SPARQL; the others are not. One of these two relies on triple reification; the other relies on named graphs. These two each incur heavy penalties in terms of both space and performance. In other words, among the six, they are the least desirable. At one point in SPARQL's evolution, the language introduced a SOURCE operator that allowed for a context argument that could be either a variable or a constant. The SOURCE construct effectively treats contexts as first-class entities. The currently-adopted named graphs notion treats contexts as second-class objects. The SOURCE operator is consistent with a fully-functional quad implementation; the named graph notion is much more limited. The principal advantage of the named graph notion is that it is only a small extension beyond the traditional RDF spec. However, major commercial vendors are implementing full support for quads. Franz's AllegroGraph has a quad implementation (actually, they mentioned quints, but the fifth argument is internal), Kowari/Tucana implements full quads, and Siderean's Seamark Navigator (my own company) has full quads. The reason for this is that full quads enable performant implementations of provenance information and named graphs do not. Security is only one aspect of provenance; I cite it because any serious implementation of a triple/quad store will include a performant security component. AllegroGraphs, Tucana, and Seamark all have security built-in. What we have here is a case were the serious commercial vendors, who care about performance, have chosen a direction different than the one adopted by SPARQL. My suggestion of to resurrect the SOURCE construct in SPARQL. UNBOUND The introduction of a negation-as-failure construct was considered and rejected my the SPARQL committee. Instead, they invented a hack, the UNBOUND operator. This is a mistake. UNBOUND is strictly less expressive than UNSAID (or whatever you may call the negation-as- failure) operator. In Seamark, we implement a closed-world version of universal quantification using a double negation (e.g., there does not exist value that does not have type X). This construct cannot be emulated using UNBOUND (at the conference, Pat verbally agreed with this claim). A model-theoretic semantics can be crafted for UNSAID. There may or may not be one for UNBOUND. In any case, UNSAID explicitly endorses negation as failure, while UNBOUND does so implicitly and inadequately. There is a strong need for negation as failure in the language; the question is, should it be endorsed openly and honestly, or should the language pretend not to have it, but then introduce a hack to allow for its partial support? CONCLUSIONS I have other objects to the SPARQL language, but these three are the most serious, and the lack of a declarative semantics is the most serious of the three. The solution is to produce a model theoretic semantics, and to modify any aspects of SPARQL that are inconsistent with a declarative semantics. SPARQL has had a long and somewhat painful evolution, owing in part to the immaturity of RDF. Its hard to design a language with an incomplete set of use cases. In choosing named graphs, it has chosen an impoverished solution that satisfies only one aspect of provenance, while major vendors are taking a more enlightened approach, full quads, that supports all manner of provenance information. In the long run, performance always wins out; quads are going to make named graphs a footnote. My suggestion is to reintroduce the SOURCE operator into SPARQL. Just as OWL has degrees of adoption, the name-graph RDF stores could support SOURCE with constant parameters, while the full-quad stores could additionally support variable arguments to SOURCE. The UNBOUND operator should be dropped from the language. Vendors can instead implement it as a a "computed predicate", and its reasonable to define a standard namespace for it, e.g., 'sparql:unbound'. I personally endorse introducing UNSAID, but I don't imagine that actually happening soon. Note: The Seamark query language includes not only UNSAID, but also IN and GROUPBY -- we consider SQL a use case for what SPARQL should be). Cheers, Bob
Received on Thursday, 24 May 2007 14:29:42 UTC