- From: Rob Shearer <Rob.Shearer@networkinference.com>
- Date: Sun, 5 Sep 2004 21:15:23 -0700
- To: "RDF Data Access Working Group" <public-rdf-dawg@w3.org>
We've made quite a bit of progess on what kinds of features we'll have in our query language, and we have a strawman to demonstrate what kind of form they all come together to create. While custom-designed languages are always very appealing because they are tailored to your current use cases, they cause huge migration problems. BRQL/RDQL certainly has its strengths, but the syntax is completely incompatible with every query language with a substantial user base. I feel that the superficial similarity to SQL is in fact more of a hindrance than a help, in that the two languages are in fact totally different and require radically different mindsets for users to be able to understand queries effectively. I still think that some form of XQuery compatibility would be invaluable to further the real-world use of RDF. I propose that we adopt a syntax which is fully compatible with XQuery, in that every DAWG query is in fact a valid XQuery. Further, I propose that results of DAWG queries should not violate the semantics of the XQuery constructs those queries include. This certainly wouldn't require that any DAWG implementation needs to support XQuery in general. We would be free to adopt a language as limited and easy-to-implement as desired. All it would mean is that the syntax we use is a subset of the XQuery syntax. It turns out that this actually isn't so hard. XQuery is tremendously expressive in general (as has been noted, it's Turing-equivalent), so it's certainly possible to translate any BRQL/RDQL-style query into XQuery syntax. What's more, the transformation is generally very straightforward. Looking at it from the XQuery point of view, you somehow need to get "RDF processing" functionality. XQuery already includes a feature for connectivity to such extended processing via "external functions" http://www.w3.org/TR/xquery/#FunctionDeclns. We could make these functions as complex as we like (technically, a single "doQuery" function which takes a string in BRQL syntax would fill the bill), but good design dictates that functions be as simple as possible. The simple language I suggest below includes adding only two simple external functions. There's a full grammar for the particular fragment of XQuery I've chosen at the end of this message, but it's probably easiest to show the transformation from BRQL syntax to XQuery syntax by example, so let's go through the BRQL spec (at version 1.52 as I write this; http://www.w3.org/2001/sw/DataAccess/rq23/) substituting an XQuery-compatible syntax for the existing syntax: 2.1 We start with simple SELECT...WHERE queries: SELECT ?title WHERE { <http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title> ?title . } result: ?title = "BRQL Tutorial" This example actually already demonstrates one of the minor problems with BRQL as it is: what's listed is NOT actually the result of the query. What's listed is a string representing a data structure that programmers must traverse to find the result. The XQuery examples will use the W3C-endorsed XML syntax to encode the structure of their result. That doesn't mean that the results needs to "be" XML (or "just" XML), only that where the result has structure I'll be using XML to write it down instead of the kind of proprietary ad-hoc syntax currently used in the BRQL spec to encode structure. The above query would be written in the XQuery-compatible syntax as: for $title in dawg:anything() where dawg:related(http://example.org/book/book1, http://purl.org/dc/elements/1.1/title, $title) return {$title} to return the result: "BRQL Tutorial" (In this case, we didn't include any "structure", so it doesn't use any XML tags.) The BRQL spec then goes on to introduce namespace prefixes: PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT ?title WHERE { <http://example.org/book/book1> dc:title ?title . } XQuery already has a well-defined system for working with such namespaces: http://www.w3.org/TR/xquery/#id-namespace-decls. The above query can be written: declare namespace dc="http://purl.org/dc/elements/1.1/" for $title in dawg:anything() where dawg:related(http://example.org/book/book1, dc:title, $title) return {$title} BRQL offers yet more syntax for declaring an empty namespace prefix: PREFIX dc: <http://purl.org/dc/elements/1.1/> PRFEIX : <http://example.org/book/> SELECT ?title WHERE { :book1 dc:title ?title . } XQuery already includes the same functionality: declare namespace dc="http://purl.org/dc/elements/1.1/" declare default element namespace "http://example.org/book/" for $title in dawg:anything() where dawg:related(http://example.org/book/book1, dc:title, $title) return {$title} 2.2 This section then attempts to explain "triple patterns" and the like. Frankly, although I know what we're getting at, I think it's more an artifact of our own roundabout path in our requirement-gathering that we're even talking about things like that. Users understand iteration and boolean predicates, and it would make a lot more sense to talk in those terms. What BRQL calls a "triple pattern" is really just a call to a function 'dawg:related' which takes three parameters (the elements of an RDF triple) and returns true if that triple exists and false if it does not. The sample query: SELECT * WHERE { ?x ?x ?v } Is written in XQuery syntax as: for $x in dawg:anything(), $y in dawg:anything() where dawg:related($x, $x, $y) return {$x, $y} 2.3 Conjunction in BRQL is encoded using curly braces and dots: SELECT ?mbox PREFIX foaf: <http://xmlns.com/foaf/0.1/> WHERE { ?x foaf:name "Johnny Lee Outlaw" . ?x foaf:mbox ?mbox . } While XQuery syntax uses the keyword "and": declare namespace foaf="http://xmlns.com/foaf/0.1" for $x in dawg:anything(), $mbox in dawg:anything() where dawg:related($mbox, foaf:name, "Johnny Lee Outlaw") and dawg:related($mbox, foaf:mbox, $mbox) return {$mbox} Again, talking about a conjunction of two boolean predicates makes a lot more sense to me than making up new notions like "graph pattern". Only RDF die-hards are interested in such concepts. 2.4 The result-formatting problem begins to become apparent in BRQL: SELECT ?name, ?mbox WHERE (?x foaf:name ?name) (?x foaf:box ?mbox) ?name = "Johnny Lee Outlaw" , ?mbox = <mailto:jlow@example.com> ?name = "Peter Goodguy" , ?mbox = <mailto:peter@example.org> XQuery syntax makes it quite straightforward to add as much, or as little, structure as you like to a result: for $x in dawg:anything(), $name in dawg:anything() $mbox in dawg:anything() where dawg:related($x, foaf:name, $name) and dawg:related($x, foaf:box, $mbox) return <x><name>{$name}</name><mbox>{$mbox}</mbox></x> <x><name>Johnny Lee Outlaw</name><mbox>mailto:jlow@example.com</mbox></x> <x><name>Peter Goodguy</name><mbox>mailto:peter@example.org</mbox></x> 3 In my opinion among the biggest limitations of BRQL is that it requires a completely new language and model for managing datatypes. This is a very very big deal, since unlike the homogenous data of relational data, in RDF you never know just which datatypes a variable may bind to: SELECT ?title ?price PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX ns: <http://example.org/ns#> WHERE { ?x dc:title ?title . ?x ns:price ?price . ?price < 30 } ?title = "The Semantic Web" , ?price = 23 The issues here are obvious. What if price isn't an integer? When will the comparison return 'true'? When a user program needs to process this data, how will it know that price is an integer? There are lots of issues still to work out. One of the most time-consuming aspect of XQuery standardization was specifying exactly what all the semantics were when datatypes interacted. Automatic casting and type conversions were defined, and a standard library of datatype predicates developed. In XQuery syntax there is no ambiguity, because the semantics are well-defined: declare namespace dc="http://purl.org/dc/elements/1.1/" declare namespace ns="http://example.org/ns#" for $x in dawg:anything(), $title in dawg:anything(), $price in dawg:anything() where dawg:related($x, dc:title, $title) and dawg:related($x, ns:price, $price) and $price < 30 return <book><title>{$title}</title><price>{$price}</price></book> result: <book><title>The Semantic Web</title><price>30</price></book> We certainly don't have to support *all* the datatype predicates available in standard XQuery implementations, but we at least have the freedom of choosing a subset of them without worrying about questionable semantics. 4.1 Optionals are a tricky subject. I still don't like the way we're doing it, particularly the fact that it's the triple (which sits in a WHERE clause and thus seems like a predicate) which determines what needs to be bound and what doesn't, instead of the variable declaration itself. SELECT ?name ?mbox PREFIX foaf: <http://xmlns.com/foaf/0.1/> WHERE { ?x foaf:name ?name . OPTIONAL { ?x foaf:mbox ?mbox } } ?name = "Alice" , ?mbox = <mailto:alice@work.example> ?name = "Bob" This query also demonstrates the problem with result formatting: unlike in SQL, results aren't simple tables; they've got a lot more structure. Just getting programmers used to standard APIs for traversing rectangular tables was hard enough. It makes a lot more sense to me that we simply allow nesting of queries. The semantics are much more clear: for $x in dawg:anything(), $name in dawg:anything() where dawg:related($x, foaf:name, $name) return <person name="{$name}"> (for $mbox in dawg:anything() where dawg:related($x, foaf:mbox, $mbox) return <mbox>{$mbox}</mbox>) </person> result: <person name="Alice"><mbox>mailto:alice@work.example</mbox></person> <person name="Bob"></person> With more than one OPTIONAL: SELECT ?name ?mbox ?hpage PREFIX foaf: <http://xmlns.com/foaf/0.1/> WHERE { ?x foaf:name ?name . OPTIONAL { ?x foaf:mbox ?mbox } . OPTIONAL { ?x foaf:homepage ?hpage } } ?name = "Alice" , ?hpage = <http://work.example.org/alice/> ?name = "Bob" , ?mbox = <mailto:bob@work.example> becomes: for $x in dawg:anything(), $name in dawg:anything() where dawg:related($x, foaf:name, $name) return <person name="{$name}"> (for $mbox in dawg:anything() where dawg:related($x, foaf:mbox, $mbox) return <mbox>{$mbox}</mbox>) (for $hpage in dawg:anything() where dawg:related($x, foaf:homepage, $hpage) return <mbox>{$hpage}</mbox>) </person> I still don't think it's ideal, but it's consistent and it's more like an extension of the existing semantics than coming up with whole new language features. 5 This section of the BRQL spec just addresses some very weird syntax that confuses the hell out of me. Targetting a language at the N3 community seems like a great way to avoid general adoption... 6 Although I still don't think we necessarily need to address source selection in this version of the language, there's a very simple approach in the XQuery syntax. Each of our new external functions can simply take an extra argument which identifies the RDF graph(s) to be queried. For example, to get a list of all of Rob's girlfriends from his little black book: for $x in dawg:anything("http://v.cx/littleblackbook.rdf") where dawg:related(Rob, hasGirlfriend, $x, "http://v.cx/littleblackbook.rdf") return $x result: (empty) Sequences are first-class types in RDF, so it's quite trivial to extend this to querying multiple RDF sources (and aggregating them) by passing multiple sources in this argument. 7 The BRQL 'or' example actually jumps the simple case to a slightly more complex one. A very simple example would be: SELECT ?channel ?creator PREFIX rss: <http://purl.org/rss/1.0/> PREFIX dc0: <http://purl.org/dc/elements/1.0/> PREFIX dc1: <http://purl.org/dc/elements/1.1/> WHERE { ?channel rdf:type rss:channel { ?channel dc0:creator ?creator } OR { ?channel dc1:creator ?creator } } The XQuery translation is obvious: for $channel in dawg:anything(), $creator in dawg:anything() where dawg:related($channel, rdf:type, rss:channel) and (dawg:related($channel, dc0:creator, $creator) or dawg:related($channel, dc1:creator, $creator)) return {$channel}, {$creator} The actual example is more complex, but it's not (just) the disjunction that causes the complexity: SELECT ?channel ?creator PREFIX rss: <http://purl.org/rss/1.0/> PREFIX dc0: <http://purl.org/dc/elements/1.0/> PREFIX dc1: <http://purl.org/dc/elements/1.1/> PREFIX pim: <http://www.w3.org/2000/10/swap/pim/contact#> WHERE { ?channel rdf:type rss:channel { ?channel dc0:creator ?creator } OR { ?channel dc1:creator ?x . ?x pim:given ?creator } OR { ?channel dc1:creator ?creator } } This comes back to another oddity in BRQL: new variables can be declared in the WHERE clause. It's not entirely obvious what the meaning of this is: do all possible Xs need to meet the condition, or just some? If two different Xs satisfy the condition, should we return two answers? This is a major departure from SQL, where the actual things that you're binding (the rows of the tables) are explicitly declared. I personally think explicit declaration is much easier to understand (and I've written all my examples that way), but adding these "extra" variables doesn't complicate the XQuery syntax much (we can use "quantified expressions": http://www.w3.org/TR/xquery/#id-quantified-expressions). What's more, it's very clear in XQuery just what the semantics of such variables are, because they're clearly scoped: for $channel in dawg:anything(), $creator in dawg:anything() where dawg:related($channel, rdf:type, rss:channel) and (dawg:related($channel, dc0:creator, $creator) or some $x in dawg:anything() satisfies (dawg:related($channel, dc0:creator, $x) and dawg:related($x, pim:given, $creator)) or dawg:related($channel, dc1:creator, $creator)) return {$channel}, {$creator} You can move all the unreturned variables in all the previous examples to "some...satisfies" clauses. For now I've left them out of the language grammar. 8 Negation is pretty straightforward in BRQL: SELECT ?x WHERE (?x rdf:type foaf:Person) NOT (?x foaf:foaf:family_name "Smith") (?x foaf:foaf:first_name "John") and similarly for XQuery (which uses a standard function to negate booleans): for $x in dawg:anything() where dawg:related($x, rdf:type, foaf:Person) and fn:not(dawg:related($x, foaf:family_name, "Smith")) and dawg:related($x, foaf:first_name, "John") return {$x} I think the main lesson here is that it's a lot easier to explain the language in terms of predicates returning true and false, instead of vague explanation of semantics. 9 BRQL offers yet another special-purpose language construct to deal with source-identification features: SELECT ?creditor, ?amount, ?dept, ?actNo, ?date WHERE SOURCE statement1.rdf ?s bank:debtor bank:act01347797 . SOURCE statement1.rdf ?s bank:creditor ?creditor . SOURCE statement1.rdf ?s bank:amount ?amnt . ?s bank:date ?date . ?e ical:dtstart ?date . ?e joco:dept ?dept . ?e joco:actNo ?actNo . I'm still not a fan of source-identification, but the four-argument version of "related" allows quite a trivial implemenation without complicating the grammar. By specifying which graph you're looking in for a triple, you can scope a predicate: for $s in dawg:anything(), $e in dawg:anything(), $creditor in dawg:anything(), $amount in dawg:anything(), $dept in dawg:anything(), $actNo in dawg:anything(), $date in dawg:anything() where dawg:related($s, bank:debtor, bank:act01347797, "statement1.rdf") and dawg:related($s, bank:creditor, $creditor, "statement1.rdf") and dawg:related($s, bank:amount, $amount, "statement1.rdf") and dawg:related($s, bank:date, $date) and dawg:related($e, ical:dtstart, $date) and dawg:related($e, joco:dept, $dept) and dawg:related($e, joco:actNo, $actNo) return {$creditor}, {$amount}, {$dept}, {$actNo}, {$date} 11.2 BRQL resorts to still more special-purpose grammar to perform formatting of output: CONSTRUCT (?x rdf:type ns:Class5) WHERE (?x ns:prop 5) XQuery allows you to define your output format any way you like (including returning just a sequence of triples). A small addition to the grammar allows outputting of fully-compliant RDF/XML documents: <rdf:RDF> { for $x in dawg:anything() where dawg:related($x, ns:prop, 5) return <ns:Class5 rdf:about="{$x}"/> }</rdf:RDF> 11.2 (again) BRQL adds more keywords to deal with the implementation-defined "DESCRIBE" functionality: DESCRIBE ?x WHERE (?x rdf:type foaf:Person) (?x foaf:mbox_sha1sum "ABCD1234") If this functionality is really desired, it's much cleaner to just use a new XQuery function which performs the requisite magic: for $x in dawg:anything() where dawg:related($x, rdf:type, foaf:Person) and dawg:related($x, foaf:mbox_sha1sum, "ABCD1234") return dawg:describe($x) 11.3 There's no entirely obvious way to extend BRQL to ask yes-no questions. (Among other things, you come back to the existential-universal dilemma.) In XQuery, the thing within the "where" clause is a well-defined thing in its own right, and it evaluates to a boolean. The following is a perfectly valid query to see if Rob works for Network Inference: related(rob, worksFor, NI) which would return "true" as a boolean value. All the conjunction, disjunction, and negation you like make perfect sense here, as do "some...satisfies" variable introductions if we decide they're worth the effort. 13 Basing our query language on XQuery means that it's already got well-defined semantics and extension points. All we need to do is choose exactly what subset of the language we want. This is presumably a balance between expressiveness (full XQuery would offer us the most) and ease of implementation (any implementation which already supports BRQL/RDQL should be able to handle a small enough subset). The grammar for the language I've used above is actually extremely simple: Query := Prolog QueryBody Prolog := ((NamespaceDecl | DefaultNamespaceDecl) ";")* NamespaceDecl := "declare" "namespace" NCName "=" StringLiteral DefaultNamespaceDecl := "declare" "default" "element" "namespace" StringLiteral QueryBody := GroundedQuery | IterativeQuery GroundedQuery := WherePred IterativeQuery := ForClause+ WhereClause? "return" ElementConstructor ForClause := "for" "$" VarName "in" VariableTypeDecl ("," "$" VarName "in" VariableTypeDecl)* VarName := [http://www.w3.org/TR/REC-xml-names/#NT-QName] VariableTypeDecl := [call to dawg:anything()] WhereClause := "where" WherePred WherePred := AndExpr ("or" AndExpr)* AndExpr := NegatedOrGroupedPred ("and" NegatedOrGroupedPred)* NegatedOrGroupedPred := AtomicPred | ( "(" WherePred ")" ) | NegatedPred NegatedPred := [start of call to fn:not] WherePred [end of call to fn:not] AtomicPred := DatavalRestriction | RelatedExpr DatavalRestriction := [datatype predicate] RelatedExpr := [call to dawg:related(node, prop, node) function] ElementConstructor := (ElementChar | "{{" | "}}" | EnclosedExpr)* EnclosedExpr := "{" "$" VarName "}" | "{" IterativeQuery "}" ElementChar := Char - [{}] Note that I've elided the exact grammar for function calls (we could use a general grammar for "function call" and leave the rest to semantics, we could require the literal text I've provides, or something in between (it seems like at least basic namespace prefixing would be worthwhile)) as well as details of which datatype predicates we'd include. I think the only things that don't inherit well-defined semantics from XQuery are our two new function calls. dawg:anything() returns a sequence of the members of set A from section 2.2 of the BRQL spec. dawg:related(s, p, o) returns 'true' if the triple {s, p, o} exists, and 'false' otherwise. (Note that there's no such thing as a "variable" passed as a parameter; in the logical model all the variables are bound all the time, just like all 'normal' languages.) More to the point, I think this grammar is small enough that it shouldn't be hard to move an existing BRQL implementation over to this syntax. There are lots of extensions worth talking about (like partitioning 'anything' to be able to address each of the U, L, V, B sets individually if desired), but let's talk about the simple case first.
Received on Monday, 6 September 2004 04:18:24 UTC