XQuery syntax for BRQL semantics

We've made quite a bit of progess on what kinds of features we'll have
in our query language, and we have a strawman to demonstrate what kind
of form they all come together to create.

While custom-designed languages are always very appealing because they
are tailored to your current use cases, they cause huge migration
problems. BRQL/RDQL certainly has its strengths, but the syntax is
completely incompatible with every query language with a substantial
user base. I feel that the superficial similarity to SQL is in fact more
of a hindrance than a help, in that the two languages are in fact
totally different and require radically different mindsets for users to
be able to understand queries effectively.

I still think that some form of XQuery compatibility would be invaluable
to further the real-world use of RDF. I propose that we adopt a syntax
which is fully compatible with XQuery, in that every DAWG query is in
fact a valid XQuery. Further, I propose that results of DAWG queries
should not violate the semantics of the XQuery constructs those queries
include.

This certainly wouldn't require that any DAWG implementation needs to
support XQuery in general. We would be free to adopt a language as
limited and easy-to-implement as desired. All it would mean is that the
syntax we use is a subset of the XQuery syntax. It turns out that this
actually isn't so hard. XQuery is tremendously expressive in general (as
has been noted, it's Turing-equivalent), so it's certainly possible to
translate any BRQL/RDQL-style query into XQuery syntax. What's more, the
transformation is generally very straightforward.

Looking at it from the XQuery point of view, you somehow need to get
"RDF processing" functionality. XQuery already includes a feature for
connectivity to such extended processing via "external functions"
http://www.w3.org/TR/xquery/#FunctionDeclns. We could make these
functions as complex as we like (technically, a single "doQuery"
function which takes a string in BRQL syntax would fill the bill), but
good design dictates that functions be as simple as possible. The simple
language I suggest below includes adding only two simple external
functions.

There's a full grammar for the particular fragment of XQuery I've chosen
at the end of this message, but it's probably easiest to show the
transformation from BRQL syntax to XQuery syntax by example, so let's go
through the BRQL spec (at version 1.52 as I write this;
http://www.w3.org/2001/sw/DataAccess/rq23/) substituting an
XQuery-compatible syntax for the existing syntax:

2.1

We start with simple SELECT...WHERE queries:

SELECT ?title
WHERE  { <http://example.org/book/book1>
<http://purl.org/dc/elements/1.1/title> ?title . }

result: ?title = "BRQL Tutorial"


This example actually already demonstrates one of the minor problems
with BRQL as it is: what's listed is NOT actually the result of the
query. What's listed is a string representing a data structure that
programmers must traverse to find the result.
The XQuery examples will use the W3C-endorsed XML syntax to encode the
structure of their result. That doesn't mean that the results needs to
"be" XML (or "just" XML), only that where the result has structure I'll
be using XML to write it down instead of the kind of proprietary ad-hoc
syntax currently used in the BRQL spec to encode structure.

The above query would be written in the XQuery-compatible syntax as:

for $title in dawg:anything()
where dawg:related(http://example.org/book/book1,
http://purl.org/dc/elements/1.1/title, $title)
return {$title}

to return the result: "BRQL Tutorial"
(In this case, we didn't include any "structure", so it doesn't use any
XML tags.)


The BRQL spec then goes on to introduce namespace prefixes:

PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?title
WHERE  { <http://example.org/book/book1> dc:title ?title . }  


XQuery already has a well-defined system for working with such
namespaces: http://www.w3.org/TR/xquery/#id-namespace-decls. The above
query can be written:

declare namespace dc="http://purl.org/dc/elements/1.1/"
for $title in dawg:anything()
where dawg:related(http://example.org/book/book1, dc:title, $title)
return {$title}

BRQL offers yet more syntax for declaring an empty namespace prefix:

PREFIX  dc: <http://purl.org/dc/elements/1.1/>
PRFEIX  : <http://example.org/book/>
SELECT ?title
WHERE   { :book1  dc:title  ?title . }

XQuery already includes the same functionality:

declare namespace dc="http://purl.org/dc/elements/1.1/"
declare default element namespace "http://example.org/book/"
for $title in dawg:anything()
where dawg:related(http://example.org/book/book1, dc:title, $title)
return {$title}

2.2

This section then attempts to explain "triple patterns" and the like.
Frankly, although I know what we're getting at, I think it's more an
artifact of our own roundabout path in our requirement-gathering that
we're even talking about things like that. Users understand iteration
and boolean predicates, and it would make a lot more sense to talk in
those terms. What BRQL calls a "triple pattern" is really just a call to
a function 'dawg:related' which takes three parameters (the elements of
an RDF triple) and returns true if that triple exists and false if it
does not.

The sample query:

SELECT *
WHERE { ?x ?x ?v }

Is written in XQuery syntax as:

for $x in dawg:anything(), $y in dawg:anything()
where dawg:related($x, $x, $y)
return {$x, $y}

2.3

Conjunction in BRQL is encoded using curly braces and dots:

SELECT ?mbox
PREFIX foaf:   <http://xmlns.com/foaf/0.1/> 
WHERE
  { ?x foaf:name "Johnny Lee Outlaw" .
    ?x foaf:mbox  ?mbox . }

While XQuery syntax uses the keyword "and":

declare namespace foaf="http://xmlns.com/foaf/0.1"
for $x in dawg:anything(), $mbox in dawg:anything()
where dawg:related($mbox, foaf:name, "Johnny Lee Outlaw")
  and dawg:related($mbox, foaf:mbox, $mbox)
return {$mbox}

Again, talking about a conjunction of two boolean predicates makes a lot
more sense to me than making up new notions like "graph pattern". Only
RDF die-hards are interested in such concepts.

2.4

The result-formatting problem begins to become apparent in BRQL:

SELECT ?name, ?mbox
WHERE
  (?x foaf:name ?name)
  (?x foaf:box ?mbox)

?name = "Johnny Lee Outlaw" , ?mbox = <mailto:jlow@example.com>
?name = "Peter Goodguy"     , ?mbox = <mailto:peter@example.org>

XQuery syntax makes it quite straightforward to add as much, or as
little, structure as you like to a result:

for $x in dawg:anything(), $name in dawg:anything() $mbox in
dawg:anything()
where dawg:related($x, foaf:name, $name)
  and dawg:related($x, foaf:box, $mbox)
return <x><name>{$name}</name><mbox>{$mbox}</mbox></x>

<x><name>Johnny Lee
Outlaw</name><mbox>mailto:jlow@example.com</mbox></x>
<x><name>Peter Goodguy</name><mbox>mailto:peter@example.org</mbox></x>

3

In my opinion among the biggest limitations of BRQL is that it requires
a completely new language and model for managing datatypes. This is a
very very big deal, since unlike the homogenous data of relational data,
in RDF you never know just which datatypes a variable may bind to:

SELECT  ?title ?price
PREFIX  dc:  <http://purl.org/dc/elements/1.1/>
PREFIX  ns:  <http://example.org/ns#> 
WHERE   { ?x dc:title ?title .  ?x ns:price ?price . ?price < 30 }

?title = "The Semantic Web"  ,  ?price = 23

The issues here are obvious. What if price isn't an integer? When will
the comparison return 'true'? When a user program needs to process this
data, how will it know that price is an integer? There are lots of
issues still to work out.

One of the most time-consuming aspect of XQuery standardization was
specifying exactly what all the semantics were when datatypes
interacted. Automatic casting and type conversions were defined, and a
standard library of datatype predicates developed. In XQuery syntax
there is no ambiguity, because the semantics are well-defined:

declare namespace dc="http://purl.org/dc/elements/1.1/"
declare namespace ns="http://example.org/ns#"
for $x in dawg:anything(), $title in dawg:anything(), $price in
dawg:anything()
where dawg:related($x, dc:title, $title)
  and dawg:related($x, ns:price, $price)
  and $price < 30
return <book><title>{$title}</title><price>{$price}</price></book>

result: <book><title>The Semantic Web</title><price>30</price></book>

We certainly don't have to support *all* the datatype predicates
available in standard XQuery implementations, but we at least have the
freedom of choosing a subset of them without worrying about questionable
semantics.

4.1

Optionals are a tricky subject. I still don't like the way we're doing
it, particularly the fact that it's the triple (which sits in a WHERE
clause and thus seems like a predicate) which determines what needs to
be bound and what doesn't, instead of the variable declaration itself.

SELECT ?name ?mbox
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
WHERE  { ?x foaf:name  ?name . OPTIONAL { ?x  foaf:mbox  ?mbox } }

?name = "Alice" , ?mbox = <mailto:alice@work.example> 
?name = "Bob" 

This query also demonstrates the problem with result formatting: unlike
in SQL, results aren't simple tables; they've got a lot more structure.
Just getting programmers used to standard APIs for traversing
rectangular tables was hard enough.

It makes a lot more sense to me that we simply allow nesting of queries.
The semantics are much more clear:

for $x in dawg:anything(), $name in dawg:anything()
where dawg:related($x, foaf:name, $name)
return <person name="{$name}">
 (for $mbox in dawg:anything()
  where dawg:related($x, foaf:mbox, $mbox)
  return <mbox>{$mbox}</mbox>)
 </person>

result:
<person name="Alice"><mbox>mailto:alice@work.example</mbox></person>
<person name="Bob"></person>

With more than one OPTIONAL:

SELECT ?name ?mbox ?hpage
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
WHERE  { ?x foaf:name  ?name . 
         OPTIONAL { ?x  foaf:mbox      ?mbox } .
         OPTIONAL { ?x  foaf:homepage  ?hpage } }

?name = "Alice" , ?hpage = <http://work.example.org/alice/>
?name = "Bob"   , ?mbox = <mailto:bob@work.example>

becomes:

for $x in dawg:anything(), $name in dawg:anything()
where dawg:related($x, foaf:name, $name)
return <person name="{$name}">
 (for $mbox in dawg:anything()
  where dawg:related($x, foaf:mbox, $mbox)
  return <mbox>{$mbox}</mbox>)
 (for $hpage in dawg:anything()
  where dawg:related($x, foaf:homepage, $hpage)
  return <mbox>{$hpage}</mbox>)
 </person>

I still don't think it's ideal, but it's consistent and it's more like
an extension of the existing semantics than coming up with whole new
language features.

5

This section of the BRQL spec just addresses some very weird syntax that
confuses the hell out of me. Targetting a language at the N3 community
seems like a great way to avoid general adoption...

6

Although I still don't think we necessarily need to address source
selection in this version of the language, there's a very simple
approach in the XQuery syntax. Each of our new external functions can
simply take an extra argument which identifies the RDF graph(s) to be
queried. For example, to get a list of all of Rob's girlfriends from his
little black book:

for $x in dawg:anything("http://v.cx/littleblackbook.rdf")
where dawg:related(Rob, hasGirlfriend, $x,
"http://v.cx/littleblackbook.rdf")
return $x

result: (empty)

Sequences are first-class types in RDF, so it's quite trivial to extend
this to querying multiple RDF sources (and aggregating them) by passing
multiple sources in this argument.

7

The BRQL 'or' example actually jumps the simple case to a slightly more
complex one. A very simple example would be:

SELECT ?channel ?creator
PREFIX  rss:  <http://purl.org/rss/1.0/>
PREFIX  dc0:  <http://purl.org/dc/elements/1.0/>
PREFIX  dc1:  <http://purl.org/dc/elements/1.1/>
WHERE { ?channel rdf:type    rss:channel
        { ?channel dc0:creator ?creator } OR 
        { ?channel dc1:creator ?creator } }

The XQuery translation is obvious:

for $channel in dawg:anything(), $creator in dawg:anything()
where dawg:related($channel, rdf:type, rss:channel)
  and (dawg:related($channel, dc0:creator, $creator)
       or dawg:related($channel, dc1:creator, $creator))
return {$channel}, {$creator}

The actual example is more complex, but it's not (just) the disjunction
that causes the complexity:

SELECT ?channel ?creator
PREFIX  rss:  <http://purl.org/rss/1.0/>
PREFIX  dc0:  <http://purl.org/dc/elements/1.0/>
PREFIX  dc1:  <http://purl.org/dc/elements/1.1/>
PREFIX  pim:  <http://www.w3.org/2000/10/swap/pim/contact#>
WHERE { ?channel rdf:type    rss:channel
        { ?channel dc0:creator ?creator } OR 
        { ?channel dc1:creator ?x .
          ?x    pim:given   ?creator } OR 
        { ?channel dc1:creator ?creator } }

This comes back to another oddity in BRQL: new variables can be declared
in the WHERE clause. It's not entirely obvious what the meaning of this
is: do all possible Xs need to meet the condition, or just some? If two
different Xs satisfy the condition, should we return two answers? This
is a major departure from SQL, where the actual things that you're
binding (the rows of the tables) are explicitly declared.

I personally think explicit declaration is much easier to understand
(and I've written all my examples that way), but adding these "extra"
variables doesn't complicate the XQuery syntax much (we can use
"quantified expressions":
http://www.w3.org/TR/xquery/#id-quantified-expressions). What's more,
it's very clear in XQuery just what the semantics of such variables are,
because they're clearly scoped:

for $channel in dawg:anything(), $creator in dawg:anything()
where dawg:related($channel, rdf:type, rss:channel)
  and (dawg:related($channel, dc0:creator, $creator)
       or some $x in dawg:anything() satisfies
           (dawg:related($channel, dc0:creator, $x)
        and dawg:related($x, pim:given, $creator))
       or dawg:related($channel, dc1:creator, $creator))
return {$channel}, {$creator}

You can move all the unreturned variables in all the previous examples
to "some...satisfies" clauses. For now I've left them out of the
language grammar.

8

Negation is pretty straightforward in BRQL:

SELECT ?x
WHERE      (?x rdf:type foaf:Person)
       NOT (?x foaf:foaf:family_name "Smith")
           (?x foaf:foaf:first_name "John")

and similarly for XQuery (which uses a standard function to negate
booleans):

for $x in dawg:anything()
where dawg:related($x, rdf:type, foaf:Person)
  and fn:not(dawg:related($x, foaf:family_name, "Smith"))
  and dawg:related($x, foaf:first_name, "John")
return {$x}

I think the main lesson here is that it's a lot easier to explain the
language in terms of predicates returning true and false, instead of
vague explanation of semantics.

9

BRQL offers yet another special-purpose language construct to deal with
source-identification features:

SELECT ?creditor, ?amount, ?dept, ?actNo, ?date
WHERE SOURCE statement1.rdf ?s bank:debtor bank:act01347797 .
      SOURCE statement1.rdf ?s bank:creditor ?creditor .
      SOURCE statement1.rdf ?s bank:amount ?amnt .
                            ?s bank:date ?date .
                            ?e ical:dtstart ?date .
                            ?e joco:dept ?dept .
                            ?e joco:actNo ?actNo .

I'm still not a fan of source-identification, but the four-argument
version of "related" allows quite a trivial implemenation without
complicating the grammar. By specifying which graph you're looking in
for a triple, you can scope a predicate:

for $s in dawg:anything(), $e in dawg:anything(),
    $creditor in dawg:anything(), $amount in dawg:anything(),
    $dept in dawg:anything(), $actNo in dawg:anything(), $date in
dawg:anything()
where dawg:related($s, bank:debtor, bank:act01347797, "statement1.rdf")
  and dawg:related($s, bank:creditor, $creditor, "statement1.rdf")
  and dawg:related($s, bank:amount, $amount, "statement1.rdf")
  and dawg:related($s, bank:date, $date)
  and dawg:related($e, ical:dtstart, $date)
  and dawg:related($e, joco:dept, $dept)
  and dawg:related($e, joco:actNo, $actNo)
return {$creditor}, {$amount}, {$dept}, {$actNo}, {$date}

11.2

BRQL resorts to still more special-purpose grammar to perform formatting
of output:

CONSTRUCT (?x rdf:type ns:Class5) WHERE (?x ns:prop 5)

XQuery allows you to define your output format any way you like
(including returning just a sequence of triples). A small addition to
the grammar allows outputting of fully-compliant RDF/XML documents:

<rdf:RDF> {
for $x in dawg:anything() where dawg:related($x, ns:prop, 5)
return <ns:Class5 rdf:about="{$x}"/>
}</rdf:RDF>

11.2 (again)

BRQL adds more keywords to deal with the implementation-defined
"DESCRIBE" functionality:

DESCRIBE ?x WHERE (?x rdf:type foaf:Person) (?x foaf:mbox_sha1sum
"ABCD1234")

If this functionality is really desired, it's much cleaner to just use a
new XQuery function which performs the requisite magic:

for $x in dawg:anything()
where dawg:related($x, rdf:type, foaf:Person)
  and dawg:related($x, foaf:mbox_sha1sum, "ABCD1234")
return dawg:describe($x)

11.3

There's no entirely obvious way to extend BRQL to ask yes-no questions.
(Among other things, you come back to the existential-universal
dilemma.)
In XQuery, the thing within the "where" clause is a well-defined thing
in its own right, and it evaluates to a boolean. The following is a
perfectly valid query to see if Rob works for Network Inference:

related(rob, worksFor, NI)

which would return "true" as a boolean value.

All the conjunction, disjunction, and negation you like make perfect
sense here, as do "some...satisfies" variable introductions if we decide
they're worth the effort.



13

Basing our query language on XQuery means that it's already got
well-defined semantics and extension points. All we need to do is choose
exactly what subset of the language we want. This is presumably a
balance between expressiveness (full XQuery would offer us the most) and
ease of implementation (any implementation which already supports
BRQL/RDQL should be able to handle a small enough subset).

The grammar for the language I've used above is actually extremely
simple:

Query := Prolog QueryBody

Prolog := ((NamespaceDecl | DefaultNamespaceDecl) ";")*

NamespaceDecl := "declare" "namespace" NCName "=" StringLiteral

DefaultNamespaceDecl := "declare" "default" "element" "namespace"
StringLiteral

QueryBody := GroundedQuery | IterativeQuery

GroundedQuery := WherePred

IterativeQuery := ForClause+ WhereClause? "return" ElementConstructor

ForClause := "for" "$" VarName "in" VariableTypeDecl
             ("," "$" VarName "in" VariableTypeDecl)*

VarName := [http://www.w3.org/TR/REC-xml-names/#NT-QName]

VariableTypeDecl := [call to dawg:anything()]

WhereClause := "where" WherePred

WherePred := AndExpr ("or" AndExpr)*

AndExpr := NegatedOrGroupedPred ("and" NegatedOrGroupedPred)*

NegatedOrGroupedPred := AtomicPred | ( "(" WherePred ")" ) | NegatedPred

NegatedPred := [start of call to fn:not] WherePred [end of call to
fn:not]

AtomicPred := DatavalRestriction | RelatedExpr

DatavalRestriction := [datatype predicate]

RelatedExpr := [call to dawg:related(node, prop, node) function]

ElementConstructor := (ElementChar | "{{" | "}}" | EnclosedExpr)*

EnclosedExpr := "{" "$" VarName "}" | "{" IterativeQuery "}"

ElementChar := Char - [{}]


Note that I've elided the exact grammar for function calls (we could use
a general grammar for "function call" and leave the rest to semantics,
we could require the literal text I've provides, or something in between
(it seems like at least basic namespace prefixing would be worthwhile))
as well as details of which datatype predicates we'd include.

I think the only things that don't inherit well-defined semantics from
XQuery are our two new function calls.
dawg:anything() returns a sequence of the members of set A from section
2.2 of the BRQL spec.
dawg:related(s, p, o) returns 'true' if the triple {s, p, o} exists, and
'false' otherwise. (Note that there's no such thing as a "variable"
passed as a parameter; in the logical model all the variables are bound
all the time, just like all 'normal' languages.)
More to the point, I think this grammar is small enough that it
shouldn't be hard to move an existing BRQL implementation over to this
syntax.

There are lots of extensions worth talking about (like partitioning
'anything' to be able to address each of the U, L, V, B sets
individually if desired), but let's talk about the simple case first.

Received on Monday, 6 September 2004 04:18:24 UTC