3.6 Optional Match

Consider the following use case: "List everyone in my address book and 
their email addresses".  For the moment assume everyone in my address 
book has a name.  I'm only trying to demonstrate optional match on the 
email address field for the moment.  Imagine my address book contains 
the following graph:

_:ann  :name  "Ann"
_:bob  :name  "Bob"
_:bob  :email <bob@work.com>
_:carl :name  "Carl"
_:carl :email <carl@work.com>
_:card :email <carl@school.edu>

The symbols _:ann, _:bob, and _:carl are bnodes.  Optional match is 
motivated by my desire to see Ann listed as part of my address book 
even though she has no email account.  In a Squish-style language, the 
naive query:

   SELECT ?name ?email
   WHERE ( ?person :name ?name )
         ( ?person :email ?email )

would not match Ann.  The approaches a query language could take 
include:

(1) Make more than one query.

This is a "do nothing" option which we could use if 3.6 Optional Match 
isn't part of our query language.  We leave it to the application to 
ask first for the names in the address book:

   SELECT ?name
   WHERE (?person :name ?name)

getting the answer ?name = "Ann" or "Bob" or "Carl", and then for each 
of these answers we query again for the email addresses:

   SELECT ?email
   WHERE (?person :name "Ann")
         (?person :email ?email)

   SELECT ?email
   WHERE (?person :name "Bob")
         (?person :email ?email)

   SELECT ?email
   WHERE (?person :name "Carl")
         (?person :email ?email)

If my address book has to be accessed over a network, the multiple 
queries will introduce latency problems that grow as my address book 
does.  (Parenthetically, I don't consider

   SELECT ?name ?email
   WHERE (?person :name ?name)
         (?person :email ?email)

to be equivalent to the three secondary queries.  I don't know Bob's 
email address after making this query; I'd have to search through the 
results looking for the ?name "Bob" myself to discover it.)


(2) Outer join

This is the approach taken by SeRQL[1] and BRQL[2], and well-documented 
because of its use in SQL.

   SELECT ?name ?email
   WHERE (?person :name ?name)
         OPTIONALLY (?person :email ?email)

As a logical variable binding expression, this would produce

   (?name = "Ann") or
   (?name = "Bob" and ?email = <bob@work.com>) or
   (?name = "Carl" and ?email = <carl@work.com>) or
   (?name = "Carl" and ?email = <carl@school.com)

The fact that the first disjunction term (?name = "Ann") is independent 
of the value of the ?email variable means that this expression doesn't 
fit perfectly into a tabular form, and must be padded with a 
metasyntactic "null" value.  In this case the null is quite 
well-defined to mean "independent of the column variable".

     ?name    ?email
   +--------+-------------------+
   | "Ann"  | (null)            |
   +--------+-------------------+
   | "Bob"  | <bob@work.com>    |
   +--------+-------------------+
   | "Carl" | <carl@work.com>   |
   +--------+-------------------+
   | "Carl" | <carl@school.edu> |
   +--------+-------------------+


(3) Embedded queries

This is basically the approach of (1), except with the secondary 
queries included with the primary query.  It's used by iTQL[3], and 
presuming that the XQuery "return" clause permits further queries to 
embedded within it, I believe it can be used by the XQuery-based 
languages REX and XsRQL.  (Rob, Howard, someone check me on this?)

   SELECT ?name ?emails
   WHERE (?person :name ?name)
   THEN ?emails = (SELECT ?email
                   WHERE (?person :email ?email))

This produces the logical variable binding

   (?name = "Ann") or
   (?name = "Bob" and ?email = <bob@work.com>) or
   (?name = "Carl" and (?email = <carl@work.com> or ?email = 
<carl@school.edu>))

and the tabular form

     ?name    ?emails
   +--------+-----------------------+
   | "Ann"  |   ?email              |
   |        | +--------+            |
   +--------+-----------------------+
   | "Bob"  |   ?email              |
   |        | +----------------+    |
   |        | | <bob@work.com> |    |
   |        | +----------------+    |
   +--------+-----------------------+
   | "Carl" |   ?email              |
   |        | +-------------------+ |
   |        | | <carl@work.com>   | |
   |        | +-------------------+ |
   |        | | <carl@school.edu> | |
   |        | +-------------------+ |
   +--------+-----------------------+

The ?emails column takes tabular rather than atomic values, and Ann's 
list of email addresses is a table with zero rows rather than a "null".


The variable binding expressions for the outer join and embedded query 
approaches are (as one would hope!) equivalent and can be converted 
from one to the other by the distributive law (A and (B or C)) <-> ((A 
and B) or (A and C)).  Substituting leftwards results in the embedded 
query form, reducing the number of terms so that, for example, "Carl" 
only appears once.  Substituting rightwards results in the outer join 
form, which is a simpler two-level sum of products (i.e. no embedded 
tables) but at the cost of duplicate fields (like "Carl") and "null" 
entries.


(And to remove the simplifying assumption, now imagine that I don't 
know Carl's name, and that Bob also goes by the name "Robert".  Now you 
*really* need optional match to make any sense of things.)


[1] http://www.openrdf.org/doc/users/ch05.html#d0e1031
[2] http://jena.hpl.hp.com/~afs/BRQL.html#Triple_Matching_Features
[3] http://www.kowari.org/193.htm

Received on Tuesday, 6 July 2004 03:10:23 UTC