RE: 3.6 Optional Match from Seaborne, Andy on 2004-07-06 (public-rdf-dawg@w3.org from July to September 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Tue, 6 Jul 2004 11:00:27 +0100
To: Simon Raboczi <raboczi@tucanatech.com>
Cc: public-rdf-dawg@w3.org
Message-ID: <E864E95CB35C1C46B72FEA0626A2E8080361629C@0-mail-br1.hpl.hp.com>
> (And to remove the simplifying assumption, now imagine that I don't
> know Carl's name, and that Bob also goes by the name "Robert".  Now you
> *really* need optional match to make any sense of things.)

+1

RDF is semi-structured in the true missing data sense.

> (2) Outer join
> 
> This is the approach taken by SeRQL[1] and BRQL[2], and well-documented
> because of its use in SQL.
. . . .
> (3) Embedded queries
> 
> This is basically the approach of (1), except with the secondary
> queries included with the primary query.

Its not quite as you describe for BRQL, (and, I understand, SeRQL as well).
I'll concentrate on BRQL/Jena as I know that better.

There are two issues here: how results are presented and how the internal
query engine goes about executing the request.

Internally, BRQL/Jena executes something closer to your subqueries.  It
builds a datastructure which shares the earlier parts of the query
execution.

It presents the results in single (unnested) table with variable number of
columns per row.  There are no nulls.  Variables that are not bound are
returned as "not present" rather than a "null" value in the query execution
space.  (Although, in Java, this is the language 'null' as per convention.
Throwing an exception would be unhelpful.)


Executing the example query on the example data:

Printing the raw results object you can see the nesting.

( ?person = _:b0 ) ( ?name = "Bob" )
    ( ?email = <bob@work.com> )
( ?person = _:b1 ) ( ?name = "Carl" )
    ( ?email = <carl@work.com> )
    ( ?email = <carl@school.edu> )
( ?person = _:b2 ) ( ?name = "Ann" )

The XML result set is similar (but longer).

Presented as a flat table, it is the flatten and missing columns are marked
"<<unset>>".  This is done in the code that creates the text output, not the
query engine.

------------------------------
| name   | email             |
==============================
| "Ann"  | <<unset>>         |
| "Carl" | <carl@school.edu> |
| "Carl" | <carl@work.com>   |
| "Bob"  | <bob@work.com>    |
------------------------------

To me, the nested form is more in keeping with RDF as it records missing
information with missing bindings.

The fact that even on a relational database, BRQL would get this structure
is an artifact of the implementation not pushing all the query to the DB at
once but having a direct implementation of the query langauge building on
the existing graph matcher.

I think we should be aim to be neutral to implementation technique (nested
tables vs outer join) and make the DAWG-QL language declarative so it is not
forcing one implementation or the other.  It isn't trivial to turn a flat
table into the nested form but it can be done (helps somewhat if the table
is clustered) - turning the nested form into a flat table is trivial.

For an SQL database, using outer join would mean one JDBC call, making
consistency easier and giving the database server the most to optimize.  If
the required presented just needs this, the query processor should be able
to do it.

	Andy

For reference:

Query used:
----------------------
SELECT ?name ?email
WHERE
    (?person ns:name ?name)
OPTIONAL
    (?person ns:email ?email)
USING ns FOR <http://example.org/ns#>
----------------------

Data:
----------------------
@prefix  : <http://example.org/ns#> .

 _:ann  :name  "Ann" .
 _:bob  :name  "Bob" .
 _:bob  :email <bob@work.com> .
 _:carl :name  "Carl" .
 _:carl :email <carl@work.com> .
 _:carl :email <carl@school.edu> . # Fixed!
----------------------

Query system: BRQL-0.4.

-------- Original Message --------
> From: Simon Raboczi <>
> Date: 6 July 2004 08:10
> 
> Consider the following use case: "List everyone in my address book and
> their email addresses".  For the moment assume everyone in my address
> book has a name.  I'm only trying to demonstrate optional match on the
> email address field for the moment.  Imagine my address book contains
> the following graph:
> 
> _:ann  :name  "Ann"
> _:bob  :name  "Bob"
> _:bob  :email <bob@work.com>
> _:carl :name  "Carl"
> _:carl :email <carl@work.com>
> _:card :email <carl@school.edu>
> 
> The symbols _:ann, _:bob, and _:carl are bnodes.  Optional match is
> motivated by my desire to see Ann listed as part of my address book
> even though she has no email account.  In a Squish-style language, the
> naive query:
> 
>    SELECT ?name ?email
>    WHERE ( ?person :name ?name )
>          ( ?person :email ?email )
> 
> would not match Ann.  The approaches a query language could take
> include:
> 
> (1) Make more than one query.
> 
> This is a "do nothing" option which we could use if 3.6 Optional Match
> isn't part of our query language.  We leave it to the application to
> ask first for the names in the address book:
> 
>    SELECT ?name
>    WHERE (?person :name ?name)
> 
> getting the answer ?name = "Ann" or "Bob" or "Carl", and then for each
> of these answers we query again for the email addresses:
> 
>    SELECT ?email
>    WHERE (?person :name "Ann")
>          (?person :email ?email)
> 
>    SELECT ?email
>    WHERE (?person :name "Bob")
>          (?person :email ?email)
> 
>    SELECT ?email
>    WHERE (?person :name "Carl")
>          (?person :email ?email)
> 
> If my address book has to be accessed over a network, the multiple
> queries will introduce latency problems that grow as my address book
> does.  (Parenthetically, I don't consider
> 
>    SELECT ?name ?email
>    WHERE (?person :name ?name)
>          (?person :email ?email)
> 
> to be equivalent to the three secondary queries.  I don't know Bob's
> email address after making this query; I'd have to search through the
> results looking for the ?name "Bob" myself to discover it.)
> 
> 
> (2) Outer join
> 
> This is the approach taken by SeRQL[1] and BRQL[2], and well-documented
> because of its use in SQL.
> 
>    SELECT ?name ?email
>    WHERE (?person :name ?name)
>          OPTIONALLY (?person :email ?email)
> 
> As a logical variable binding expression, this would produce
> 
>    (?name = "Ann") or
>    (?name = "Bob" and ?email = <bob@work.com>) or
>    (?name = "Carl" and ?email = <carl@work.com>) or
>    (?name = "Carl" and ?email = <carl@school.com)
> 
> The fact that the first disjunction term (?name = "Ann") is independent
> of the value of the ?email variable means that this expression doesn't
> fit perfectly into a tabular form, and must be padded with a
> metasyntactic "null" value.  In this case the null is quite
> well-defined to mean "independent of the column variable".
> 
>      ?name    ?email
>    +--------+-------------------+
>    | "Ann"  | (null)            |
>    +--------+-------------------+
>    | "Bob"  | <bob@work.com>    |
>    +--------+-------------------+
>    | "Carl" | <carl@work.com>   |
>    +--------+-------------------+
>    | "Carl" | <carl@school.edu> |
>    +--------+-------------------+
> 
> 
> (3) Embedded queries
> 
> This is basically the approach of (1), except with the secondary
> queries included with the primary query.  It's used by iTQL[3], and
> presuming that the XQuery "return" clause permits further queries to
> embedded within it, I believe it can be used by the XQuery-based
> languages REX and XsRQL.  (Rob, Howard, someone check me on this?)
> 
>    SELECT ?name ?emails
>    WHERE (?person :name ?name)
>    THEN ?emails = (SELECT ?email
>                    WHERE (?person :email ?email))
> 
> This produces the logical variable binding
> 
>    (?name = "Ann") or
>    (?name = "Bob" and ?email = <bob@work.com>) or
>    (?name = "Carl" and (?email = <carl@work.com> or ?email =
> <carl@school.edu>))
> 
> and the tabular form
> 
>      ?name    ?emails
>    +--------+-----------------------+
>    | "Ann"  |   ?email              |
>    |        | +--------+            |
>    +--------+-----------------------+
>    | "Bob"  |   ?email              |
>    |        | +----------------+    |
>    |        | | <bob@work.com> |    |
>    |        | +----------------+    |
>    +--------+-----------------------+
>    | "Carl" |   ?email              |
>    |        | +-------------------+ |
>    |        | | <carl@work.com>   | |
>    |        | +-------------------+ |
>    |        | | <carl@school.edu> | |
>    |        | +-------------------+ |
>    +--------+-----------------------+
> 
> The ?emails column takes tabular rather than atomic values, and Ann's
> list of email addresses is a table with zero rows rather than a "null".
> 
> 
> The variable binding expressions for the outer join and embedded query
> approaches are (as one would hope!) equivalent and can be converted
> from one to the other by the distributive law (A and (B or C)) <-> ((A
> and B) or (A and C)).  Substituting leftwards results in the embedded
> query form, reducing the number of terms so that, for example, "Carl"
> only appears once.  Substituting rightwards results in the outer join
> form, which is a simpler two-level sum of products (i.e. no embedded
> tables) but at the cost of duplicate fields (like "Carl") and "null"
> entries.
> 
> 
> (And to remove the simplifying assumption, now imagine that I don't
> know Carl's name, and that Bob also goes by the name "Robert".  Now you
> *really* need optional match to make any sense of things.)
> 
> 
> [1] http://www.openrdf.org/doc/users/ch05.html#d0e1031
> [2] http://jena.hpl.hp.com/~afs/BRQL.html#Triple_Matching_Features
> [3] http://www.kowari.org/193.htm
Received on Tuesday, 6 July 2004 06:01:04 UTC