Re: OPTIONALs and shared variables from Eric Prud'hommeaux on 2006-06-12 (public-rdf-dawg@w3.org from April to June 2006)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Mon, 12 Jun 2006 22:35:20 +0200
To: Steve Harris <steve.harris@garlik.com>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <20060612203520.GA16448@w3.org>
On Thu, Jun 08, 2006 at 11:12:40AM +0100, Steve Harris wrote:
> 
> In the telecon yesterday I mentioned that allowing shared variables  
> introduced in OPTIONAL (and UNION) blocks prevented you from  
> implementing it in a conventional relational algebra engine. Example:
> 
> :a dc10:title "thing" .
> :a dc10:creator "John Smith" .
> :a dc:11:creator _:jb .
> _:jb rdf:value "Joe Bloggs" .
> 
> SELECT ?name
> WHERE {
>   ?a dc10:title "thing" .
>   OPTIONAL { ?a dc10:creator ?name }
>   OPTIONAL { ?a dc11:creator ?c .
>              ?c rdf:value ?name }
> }

This evaluation order of consecutive OPTIONALs is ordered by 5.1:
[[
The OPTIONAL keyword is left-associative
]]
so any bindings for ?name in the first OPTIONAL constrain the
second. Thus, it's equivilent to two consecutive left outer joins
and yeilds what you would expect from SQL (adding a SELECT)
(sorry, not swapped in on relational algebra notation):

SELECT T1.obj AS "?name", S0.var_c AS "?c"
  FROM T as T0
   LEFT OUTER JOIN T as T1 ON T1.subj=T0.subj AND T1.pred="dc10:creator"
   LEFT OUTER JOIN (
      SELECT T2.subj AS var_a, T2.obj AS var_c, T3.obj AS var_name
        FROM T as T2
        JOIN T as T3 ON T3.subj=T2.obj AND T3.pred="rdf:value"
       WHERE T2.pred="dc11:creator"
   ) AS S0 ON S0.var_a=T0.subj AND T1.obj IS NULL OR S0.var_name=T1.obj
 WHERE T0.pred = "dc10:title" AND T0.obj = "thing"

|    ?name     |  ?c   |
+--------------+-------+
| "John Smith" |       |

However, if you simply insert another triple (pattern) in the mix:,

  :a dc10:creator "John Smith" .
+ :a cc:license "steal this book" .
  :a dc:11:creator _:jb .

  OPTIONAL { ?a dc10:creator ?name }
+ ?a cc:license ?license .
  OPTIONAL { ?a dc11:creator ?c .

the text in 4.3 becomes relevent:
[[
There is no implied order of graph patterns within a Group Graph
Pattern. Any solution for the group graph pattern that can satisfy all
the graph patterns in the group is valid, independently of the order
that may be implied by the lexical order of the graph patterns in the
group.
]]

(In fact, I see 5.1 as contradicting this, but it comes later in the
spec and is pretty clearly a more specific case.)

Now this parsers as two patterns with OPTIONALs in them. "Any solution
... satisfy all the graph patterns" means that we have to include
everything where
  OPTIONAL { ?a dc10:creator ?name }
constrains
  OPTIONAL { ?a dc11:creator ?c .
             ?c rdf:value ?name }
and visa versa:.
|    ?name     |  ?c   |
+--------------+-------+
| "John Smith" |       |
| "Joe Bloggs" | _:jb1 |

This can be replicated in SQL with a union of the queries with the two
left outer joins every possible order (two orders):

SELECT S3.var_name AS "?name", S3.var_c AS "?c"
  FROM (
      SELECT S1.var_c AS var_c, T1.obj AS var_name
        FROM T as T0
        JOIN T as T7 ON T7.subj=T0.subj AND T7.pred="cc:license"
        LEFT OUTER JOIN T as T1 ON T1.subj=T0.subj AND T1.pred="dc10:creator"
        LEFT OUTER JOIN (
           SELECT T2.subj AS var_a, T2.obj AS var_c, T3.obj AS var_name
             FROM T as T2
             JOIN T as T3 ON T3.subj=T2.obj AND T3.pred="rdf:value"
            WHERE T2.pred="dc11:creator"
        ) AS S1 ON S1.var_a=T1.subj AND S1.var_name=T1.obj
       WHERE T0.pred = "dc10:title" AND T0.obj = "thing"
    UNION
      SELECT S2.var_c AS var_c, S2.var_name AS var_name
        FROM T as T0
        JOIN T as T7 ON T7.subj=T0.subj AND T7.pred="cc:license"
        LEFT OUTER JOIN (
           SELECT T5.subj AS var_a, T5.obj AS var_c, T6.obj AS var_name
             FROM T as T5
             JOIN T as T6 ON T6.subj=T5.obj AND T6.pred="rdf:value"
            WHERE T5.pred="dc11:creator"
        ) AS S2 ON S2.var_a=T0.subj
        LEFT OUTER JOIN T as T4 ON S2.var_a=T4.subj AND T4.pred="dc10:creator" AND T4.obj=S2.var_name
       WHERE T0.pred = "dc10:title" AND T0.obj = "thing"
   ) AS S3

(I tried to factor out T0 and T7, but was unable to make the FROM be
optional without making the whole UNION be optional and widening the
results set and making it a bit complicated.)

This case isn't so bad because there are only two OPTIONALs that might
bind ?name. If there were three, the UNION would include all 3!=6
possible orderings. In general, one must execute all the possible
orderings of all the OPTIONAL patterns that introduce any of the same
variables.


ALTERNATIVES:
The above SQL is an interpretation by the current spec. I see
intuitive interpreations of the query:

DISJOINT: the old definition for the spec said that ?name could not
show up in both of those OPTIONALs unless it was already bound by a
non-optional (and, i guess, only by a UNION if it was bound in all
sides of the UNION). Thus, the variables bound by any OPTIONAL
patterns were disjoint. This isn't too rough to detect, and preserves
blind re-ordering of the graph pattern.


LEXICAL: SQL OUTER JOINs are defined by the LEXICAL order order. Query
engines or proxy medlers are free to re-order so long as they are
clever enough to know that the resulting order will give equivilent
results.


COMPREHENSIVE: The current spec gives the most complete answer set but
is hard to implement and can be exponentially (factorially, actually)
more expensive to execute. Also, the current spec has the odd effect
that inserting that triple pattern in our earlier query drastically
changed the semantics.


> if there are no shared variables you can map this to:
> (excuse syntax, ASCII RA is difficult and it's been a while)
> 
> 1)  T SELECT pred=dc10:title AND obj=thing RENAME subj,a
> 2)  LJOIN T a=subj RENAME obj,name
> 3)  LJOIN T a=subj RENAME obj,c
> 4)  LJOIN T c=subj AND obj=name
> 
> where T is the table of triples, but if you do that you get:
> 
> 1)  a    c     name
>     :a   NULL  NULL
> 
> 2)  a    c     name
>     :a   NULL  "John Smith"
> 
> 3)  a    c     name
>     :a   NULL  "John Smith"
>     :a   :_jb  NULL
> 
> 4)  a    c     name
>     :a   NULL  "John Smith"
>     # obj = "John Smith" and obj = NULL both fail
> 
> What you have to do (as Souri? pointed out) is namespace the values  
> according to where they're bound, which is tricky, steps outside  
> relational algebra and makes SPARQL significantly harder to implement  
> that necessary.
> 
> In this simple example it is probably possible to implement it with  
> sub-projections or variable mapping and using a big RENAME on the  
> end, but in the general case, it's not.
> 
> My impression is that this scoped shared variable thing is mostly a  
> hack to get round the fact that we don't have SQLs COALESCE() and  
> SELECT expressions, but I think it's a bad choice to make the core  
> language much more complex to get round it.
> 
> The previous form of:
> 
> SELECT ?name1 ?name2
> WHERE {
>   ?a dc10:title "thing" .
>   OPTIONAL { ?a dc10:creator ?name1 }
>   OPTIONAL { ?a dc11:creator ?c .
>              ?c rdf:value ?name2 }
> }
> 
> didn't have these problems of course, and as a bonus cures the order  
> dependency + exponential complexity problem Eric described. The only  
> downside being that the end user had to do the coalescing themselves.
> 
> - Steve

-- 
-eric

home-office: +1.617.395.1213 (usually 900-2300 CET)
cell:        +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Monday, 12 June 2006 20:44:33 UTC