Re: Applying the relational model to SPARQL from Andrew Newman on 2006-11-11 (public-rdf-dawg-comments@w3.org from November 2006)

From: Andrew Newman <andrewfnewman@gmail.com>
Date: Sat, 11 Nov 2006 22:26:23 +1000
To: "Eric Prud'hommeaux" <eric@w3.org>
Cc: public-rdf-dawg-comments@w3.org
Message-ID: <2db5a5c40611110426h68154cecre429879837e662d4@mail.gmail.com>
On 11/10/06, Eric Prud'hommeaux <eric@w3.org> wrote:
> On Thu, Nov 09, 2006 at 05:52:25AM +1000, Andrew Newman wrote:
> ====Duplicates====
> You note that
> [[
> SQL is often seen as an implementation of the relational model even
> though it has numerous incompatibilities with the relational model
> such as bag instead of set semantics, column ordering, duplicates and
> handling of nulls [6].
> ]]
> and later state
> [[
> As both the RDF model and the relational model are both propositional
> and set-based it is likely that a compatible model for querying RDF
> can be provided.
> ]]
>
> Does the relational model you propose preserve duplicates?
> Specifically, are the ⋈(null-accepting join) and π (sorry, I may have
> the wrong codepoint there, using a greek lower case pi) defined to
> work on sets or bags?  Is this choice orthogonal to the rest of the
> calculus?
>

So the relational model inherently removes duplicates.  Relations are
sets of tuples and the equality of tuples is based on the
attribute/value pairs they contain.  So you can't have two tuples in a
relation (two rows in a table) with the same values.  They can differ
based on missing values though ({?v=f},  {?v=f, ?b=g} for example).

You can make the choice to have a model that uses bags rather than
sets.  From what I understand, commercial databases take SQL and
transform it into the relational constructs in order to do query
reordering and other optimisations.  They are limited to what they can
do because they have to keep track of duplicates.  It makes sense to
me to skip that step if you can and make use of as many optimisations
as possible.  Not worrying about duplicates makes it easier from both
a development and user perspective.

> §2.4 Duplicates implies that you favor bags. While I don't know all of
> Date's arguments against duplicates, I suspect that the use cases for
> aggregates motivate him to have duplicates in the algebra, just not in
> the report (guessing here). A practical implication of closure is that
> one should be able to implement the aggregate downstream of some
> report {SELECT ?type WHERE { ?employee a ?type }} , without burdening
> the querier to project in something distinct, like the employee Id.
> Since the DAWG has posponed aggregates [AG], users will need a bag
> semantics to be able to postprocess with their one aggregates.
>

I think the best thing the DAWG could do is remove DISTINCT and
implement aggregates without bag semantics.  Without binding the
result to something distinct, like an employee id (an inverse
functional property), I don't see how it makes sense to count
anything.

Date created the EXTEND and SUMMARIZE operators.  "Database In Depth"
page 95-99 provides some examples as how to implement aggregates
without bag semantics.

> ====NULLs====
> [[
> Cyganiak says, "The SPARQL model does not, for example, distinguish
> between an OPTIONAL variable that is unbound in some solutions, and a
> variable that is not used in the query at all." This result leads to
> confusion as to what an unbound result means.
> ]]
> I think this would be analogous to an SQL that allowed me to select
> non-existent columns and responded with NULLs. Are there use cases
> where this confusion matters?
>

The immediate problem I see is that you can't just look at the results
of the query and say "if ?v is found or ?v is not found".  The client
has to be aware what the original query was and then says "if ?v is
found or ?v is not found do or ?v in the query was not used".

> ====DEE and DUM====
> I interprated the SPARQL Optional table as specifying the results of
> row heading OPT column heading. Substituting in T, F and R:
>
>  ┏━━━┯━━━━━━━━┯━━━━━━━━┯━━━━━━━━━┓
>  ┃   │   T    │   F    │   R     ┃ T=DEE
>  ┃ T │ T⟕T=T  │ T⟕F=T  │ T⟕Ro=T  ┃ F=DUM
>  ┃ F │ F⟕T=F  │ F⟕F=F  │ F⟕Ro=F  ┃ R=non-optional relation
>  ┃ R │ R⟕T=Rr │ R⟕F=Rr │ R⟕Ro=Rr ┃ Ro=option relation
>  ┗━━━┷━━━━━━━━┷━━━━━━━━┷━━━━━━━━━┛ Rr=resulting relation
>
> Searching for corresponding queries, I get arrive at:
>
>  ┏━━━┯━━━━━━━━━━━━━━━━━━━━━━┯━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
>  ┃   │     T                │ F │           R                      ┃
>  ┃ T │ { OPT {} }           │   │ { OPT { ?sO ?pO ?oO } }          ┃
>  ┃ F │                      │   │                                  ┃
>  ┃ R │ { ?s ?p ?o OPT { } } │   │ { ?s ?p ?o OPT { ?sO ?pO ?oO } } ┃
>  ┗━━━┷━━━━━━━━━━━━━━━━━━━━━━┷━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
> Basically, I didn't figure out how to write queries that used F (DUM).
>

I'm not sure it is possible using SPARQL syntax of OPTIONAL.  In
SPARQL you could perhaps take the result of an ASK query and feed it
into another.  In iTQL we had something that was similar with nested
subqueries and an empty SELECT.

>
> ====Order Independent Joins====
> If the motivation for the full outer join was the freedom to optimize
> by re-ordering, I think we also have to look at the performance of a
> re-orderable full outer join vs. the non-reorderable left join.
>
> I am generally in favor of left joins over full outer joins for
> intuition and parallelism with SQL (which I suppose is the source of
> much of the "intuition"). I need to explore the subsumption scheme you
> outlined to see how it interacts with queris for the non-existance of
> a particular pattern (NAF is spelled OPTIONAL !BOUND in SPARQL).
>

I started off this year by trying to fix OPTIONAL.  Basically, I saw
the order dependent nature of it as a mistake.  I don't really feel
that a full outer join version of OPTIONAL is necessarily the right
answer either, I think it's better in some ways though.  My feeling is
that full outer join returns too many results but I don't really know
of a way to quantify this.

If you read some of Date he actually dislikes outer joins all
together.  One of the reasons he states is that it's hard to express
in a clear and easy to understand way.  His solution is to use EXTEND
and relation-valued attributes (much like the nested queries/results
in iTQL).  That would require a much bigger rethink and I doubt the
SPARQL group are likely to do this.
Received on Saturday, 11 November 2006 12:26:33 UTC