RE: Applying the relational model to SPARQL


I stopped commenting on SPARQL quite a while back, roughly around the time
when some newbie declared that a use case was needed to justify adding
sorting to SPARQL.  However, since I believe that SPARQL's faults lie
in not being close enough to SQL semantics, I thought I would lodge an
opposing view to the one espoused by Eric.

When I taught graduate courses in databases, I examined the SQL language,
and at the time criticised it in front of my students for some ambiguities,
mostly appearing in the linkage between GROUP-BY and HAVING clauses.  However,
since then, I've spent some time on language standards committees, notably the
KRLS and KIF committees, and learned more about what goes into designing
a good language.  The upshot is that both SQL and KIF (now Common Logic) are
remarkablyl well-designed languages, both in terms of syntax and semantics.

When the SQL2 spec came out with procedural outer-joins, I was surprised,
but when I investigated the alternatives, I discovered that there is no
declarative semantics for outer joins that makes any sense.  So the SQL
committee in fact did a good job once again; necessity dictated some kind
of outer join, while important commercial ones were inherently ambiguous.
So, they cleaned things up as best as could be.

That brings us to SPARQL.  SPARQL is a major disappointment.  The most
grievous error is the distinction between the WHERE and FILTER clauses.
The faceted navigation product that my company sells generates RDF
queries that cannot be expressed in SPARQL because they frequently
use an OR connective that includes both statements and filters within
the disjuncts.  In otherwords, the queries we routinely execute cannot
be processed by a SPARQL query engine.

Other faults:  

- Both SQL and KIF have recursive syntax; queries can nest
within queries.  SPARQL does not.

- SQL has closed world semantics, which has proven to be far more
useful than open-world semantics.  The PowerLoom language we designed
when I was at USC/ISI included closed-world operators that compensated
for the underlying open world semantics in the logic.

- The UNBOUND operator is inherently procedural in a fully-expressive
logic language.  In PowerLoom, we added a "Prolog-mode" (we didn't
call it that) when we used operators that couldn't be reordered by
the query optimizer.  The WHERE/FILTER blotch is inherently procedural,
which solves the problem, but in a bad way.

- Its been claimed that the (stunted) semantics of the SPARQL UNION operator
is not compatible with the OPTIONAL operator.  I haven't tried to verify that,
but I suspect its true.  A year or so ago I worked out a semantics that
provided a uniform semantics for OR (a non-stunted UNION), OPTIONAL, and
UNSAID.  I passed it by Pat Hayes, and he thought it looked consistent (but
I won't hold him to that).  The query processor in PowerLoom and the one 
I've written more recently both implement a semantics that is either
close or the same as that semantics.

- We found applications while using PowerLoom that benefitted with the
introduction of a null value.  Nulls are to be avoided in general, but
the combination of null and single-valuedness that routinely occurs in
SQL is vastly more efficient than the SPARQL OPTIONAL operator.

Other faults are sins of omission rather than commission:

- If my recollection SPARQL omits n-ary computed predicates (implementing
only n-ary functions).

- Its enormously useful to support a GROUP-BY operator.  Again, both
PowerLoom and our own SeamarkQL language have one, demonstrating that
its quite compatible with a logic language.

There are more omissions, but you get the point.

The primary movers of the SPARQL committee chose to believe that they
were inventing a brave new logic language for a universe of applications
markedly different than the relational database world that has been so
spectacularly successful for the last 25 years.  In fact the opposite is
true; it takes relatively few extensions to a true SQL dialect to achieve
a logic language much more appropriate and useful than SPARQL.

Bob MacGregor



-----Original Message-----
From: public-rdf-dawg-comments-request@w3.org [mailto:public-rdf-dawg-comments-request@w3.org] On Behalf Of Eric Prud'hommeaux
Sent: Thursday, November 09, 2006 10:38
To: Andrew Newman
Cc: public-rdf-dawg-comments@w3.org
Subject: Re: Applying the relational model to SPARQL


On Thu, Nov 09, 2006 at 05:52:25AM +1000, Andrew Newman wrote:
> 
> I'd like present to the DAWG public comments list my Honours thesis.
> It discusses a formal model, the relational model, to SPARQL.  It 
> builds on the work of by Cyganiak, Frasincar et al., Harris and 
> Shadbolt, Pérez et al. and others.  Hopefully, it's appropriate to 
> some of the current discussions.
> 
> It's available at (~500K):
> http://jrdf.sourceforge.net/RelationalBasedSPARQL.pdf


thank you very much for this work! it looks very useful to the DAWG and the community as a whole.

> In it I suggest that the current SPARQL specification is directly 
> influenced by implementation specifics such as SQL and not RDF.  It is 
> argued that the semantics of SQL is a poor match to the RDF data 
> model.  Examples of this mismatch include:
> * The existence of NULL (section 2.3).
> * UNION and other operations may or may not return duplicates (section 2.4).
> * Lack of Compositional Semantics (section 2.5).
> * Order dependent OPTIONAL (like SQL's left outer join) (section 2.6).


====Duplicates====
You note that
[[
SQL is often seen as an implementation of the relational model even though it has numerous incompatibilities with the relational model such as bag instead of set semantics, column ordering, duplicates and handling of nulls [6].
]]
and later state
[[
As both the RDF model and the relational model are both propositional and set-based it is likely that a compatible model for querying RDF can be provided.
]]

Does the relational model you propose preserve duplicates?
Specifically, are the ⋈(null-accepting join) and π (sorry, I may have the wrong codepoint there, using a greek lower case pi) defined to work on sets or bags?  Is this choice orthogonal to the rest of the calculus?

§2.4 Duplicates implies that you favor bags. While I don't know all of Date's arguments against duplicates, I suspect that the use cases for aggregates motivate him to have duplicates in the algebra, just not in the report (guessing here). A practical implication of closure is that one should be able to implement the aggregate downstream of some report {SELECT ?type WHERE { ?employee a ?type }} , without burdening the querier to project in something distinct, like the employee Id.
Since the DAWG has posponed aggregates [AG], users will need a bag semantics to be able to postprocess with their one aggregates.


====NULLs====
[[
Cyganiak says, “The SPARQL model does not, for example, distinguish between an OPTIONAL variable that is unbound in some solutions, and a variable that is not used in the query at all.” This result leads to confusion as to what an unbound result means.
]]
I think this would be analogous to an SQL that allowed me to select non-existent columns and responded with NULLs. Are there use cases where this confusion matters?


====DEE and DUM====
I interprated the SPARQL Optional table as specifying the results of row heading OPT column heading. Substituting in T, F and R:

 ┏━━━┯━━━━━━━━┯━━━━━━━━┯━━━━━━━━━┓
 ┃   │   T    │   F    │   R     ┃ T=DEE
 ┃ T │ T⟕T=T  │ T⟕F=T  │ T⟕Ro=T  ┃ F=DUM  ┃ F │ F⟕T=F  │ F⟕F=F  │ F⟕Ro=F  ┃ R=non-optional relation  ┃ R │ R⟕T=Rr │ R⟕F=Rr │ R⟕Ro=Rr ┃ Ro=option relation  ┗━━━┷━━━━━━━━┷━━━━━━━━┷━━━━━━━━━┛ Rr=resulting relation

Searching for corresponding queries, I get arrive at:

 ┏━━━┯━━━━━━━━━━━━━━━━━━━━━━┯━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃   │     T                │ F │           R                      ┃
 ┃ T │ { OPT {} }           │   │ { OPT { ?sO ?pO ?oO } }          ┃
 ┃ F │                      │   │                                  ┃
 ┃ R │ { ?s ?p ?o OPT { } } │   │ { ?s ?p ?o OPT { ?sO ?pO ?oO } } ┃
 ┗━━━┷━━━━━━━━━━━━━━━━━━━━━━┷━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Basically, I didn't figure out how to write queries that used F (DUM).


====Order Independent Joins====
If the motivation for the full outer join was the freedom to optimize by re-ordering, I think we also have to look at the performance of a re-orderable full outer join vs. the non-reorderable left join.

I am generally in favor of left joins over full outer joins for intuition and parallelism with SQL (which I suppose is the source of much of the "intuition"). I need to explore the subsumption scheme you outlined to see how it interacts with queris for the non-existance of a particular pattern (NAF is spelled OPTIONAL !BOUND in SPARQL).


> Outcomes presented include:
> * A way of mapping RDF and SPARQL operations to the relational model 
> (section 4).
> * Using tuple subsumption to implement UNION and OPTIONAL using 
> previous optimisation techniques (section 2.7) that is up to twice as 
> fast as an alternate implementation (using join, antjoin and union) 
> and up to 8 times faster than ARQ (section 4.5).
> * An order independent version of OPTIONAL using full outer join and 
> tuple subsumption (section 4.4).
> 
> Suggested future work includes:
> * Using SQL to implement tuple subsumption OPTIONAL and UNION  
> (section 5.3)
> * Alternative ways of implementing ASK and CONSTRUCT (section 5.1) 
> using the relational model as a basis.
> * Aggregate functions (section 5.1).
> * Other optimisation techniques if compositional semantics are chosen 
> (section 5.2).
> 
> The current code is only available through SF subversion:
> svn co https://svn.sourceforge.net/svnroot/jrdf jrdf

[AG] http://www.w3.org/2001/sw/DataAccess/issues#countAggregate

--
-eric

home-office: +1.617.395.1213 (usually 900-2300 CET)
     +33.1.45.35.62.14
cell:       +33.6.73.84.87.26

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than email address distribution.

Received on Friday, 10 November 2006 16:58:52 UTC