XQuery-style extensibility and filtering

The question of external functions as an extensibility mechanism in XQuery
came up during this morning's telecon, along with the topic of boolean
filtering. As a personal action item, I started out with the intention of
providing examples of both mechanisms in XQuery. Since I've been devoting
large amounts of personal play time however to devising an RDF path notation
that's patterned very tightly on XQuery and is now at least three-quarters
baked :-), I thought I'd take a big leap here with your forbearance and
illustrate these mechanisms in my own provisional attempt at a dawg-ql.
Whether you like what I've come up with or not, I hope at a minimum that it
provides a useful basis for further discussion.

First, a few quick "dawg-path" examples. I'm using an "@" notation for
predicates in a striped (subject/@predicate/object) syntactic style. The "@"
helps disambiguate short paths and provides helpful visual cues for
readability (imho). I'm playing with a BNF at the moment in which the above
three-item subject/@predicate/object sequence is the longest possible path
through the graph. Here are more :

============================================

      *
(all nodes; subject and object both)

      @foaf:*
(a listing of all (possibly distinct) foaf properties in the graph (TBD --
in XQuery you'd need to explicitly call distinct-values() on this)

     *[ @foaf:* ]
(any subject in any vocabulary having a foaf: property)

     ex:subject107/@*
(all properties belonging to subject ex:subject107)

   ex:*/@*/*
(all objects owned by ex: subjects)

    ex:*/@*/literal()
(literals only owned by ex: subjects)

    ex:*/foaf:*[ literal() ]
(foaf: properties of ex: subjects having literal values -- as opposed to the
values themselves)

    ex:*/foaf:*[ literal() = "1992" ]
(foaf: properties of ex: subjects having a literal string value of "1992" )

   ex:*/foaf:*[ literal() = ^^xsd:string ]

(and if you really want to have fun with your indices, any strings
whatsoever)

Note: if we were restricted to using only this xpath-style notation, we'd
only be providing the equivalent of a single-variable-binding capability in
the result set, which would be a major restriction. See further however ...

===========================================

Here's the main query I want to illustrate. Building on Andy's example: Find
all subjects having a foaf:name of "Fernando Cosmopolitan" at an asemantics
mailing address.

We could state this XQuery-like in several ways:

(1) Somewhat verbosely
---------------------------

declare function contains-string( dawg-ql:Literal+ $source, xsd:string
$containsStr ) as xsd:boolean external;

*[ @foaf:name[ literal() = "Fernando Cosmopolitan"^^xsd"string ]] intersect
*[ @foaf:mbox[ contains-string( literal(), "asemantics.com" ) ]]

returns all subject nodes meeting both conditions. The empty line is
whitespace for readability (allowed in XQuery). literal() is patterned after
XQuery/XPath's "kindTest" mechanism [eg., .../node() and .../text()] and
returns matching literals. The Literal+ in in the function declaration OTOH
is part of a type specification for the first argument to the function (see
next paragraph). intersect is an operator that takes two arguments, both of
either type dawg-ql:Node* or dawg-ql:Predicate* (0 or more of each), and
returns the intersection of the two sets: all nodes belonging to both. We
short-circuit on a null sequence result from either side. (What we do in the
case of dissimilar types is fun to contemplate.)

The externally supplied boolean function contains-string() shows how to
provide extended string-handling capability (for example) that we won't be
providing in our native language (because of complex i17n collation issues
or whatever). The single-line prolog declares the function to be external --
defined on the client side of the fence; we only specify the signature. The
arguments to the function and the intersect operator above provide
XQuery-style type-checking capability [1]: dawg-ql:Literal+ assumes a sequen
ce of one or more Literal nodes via the first parameter; xsd:string assumes
a single string for the other. The function returns a single boolean. [2]
The names of the arguments in the declaration ($sourceStr, $containsStr) are
optional and provided in this case for documentation purposes.

(2) Somewhat more terse
-----------------------------

declare function contains-string( dawg-ql:Literal+, xsd:string ) as
xsd:boolean external;
*[ @foaf:name[ "Fernando Cosmopolitan" ]] intersect *[ @foaf:mbox[
contains-string( "asemantics.com" ) ]]

The BNF automatically provides a string type for "Fernando" (ie,
StringLiteral, and could easily do so for ints and floats as well). The
style of function invocation in the second statement (the query "body")
assumes that all (literal) node values for foaf:mailbox are passed to the
function as an implicit argument, and that we're also not bothering to
specify a namespace for our own function (see below).

(3) Expanded for readability (both input and output)
------------------------------------------------

declare prefix "externalLib" as
"http://definedOutsideTheDAWGSpecification.com";
declare function externalLib:contains-string( dawg-ql:Literal+ $sourceStr,
xsd:string $containsStr ) as xsd:boolean external;

let $people := *[ @foaf:name[ "Fernando Cosmopolitan" ]]
let $mailBoxes := *[ @foaf:mbox[ externalLib:contains-string( literal(),
"asemantics.com" ) ]]
return
      (: not sure if parens required for precedence; never a bad idea if
unsure :)
      for $match in ( $people intersect $mailBoxes )
      return   (: we construct an output sequence of multiple 3-item
subsequences :)
             ( "subject = ", $match, chr(10) ) (: last item is a
string-function-provided linefeed :)


This example demonstrates an XQuery-like variable-binding style of output
annotation and assumes that the dawg-ql data model, similar to XQuery,
allows heterogeneous sequences of items, including in this case items of
type xsd:string and dawg-ql:Node (in the let variables and the return
sequence) and dawg-ql:Literal (in the function call).  I'm also adding a
namespace declaration for the external function in the prolog to
disambiguate it from our own built-ins (all function and variable names in
XQuery are QNames, which is kind of cool.)

There's more, such as mechanisms for returning triples in the result
sequence and the like, but I think that's sufficient to get the pot bubbling
... :-)

Comments?
Howard

[1] Don't freak at the mention of XQuery type-checking capability. The bulk
of the complexity in XQuery (je contend) comes from all the complications
arising from the need to be able to type XML nodes using XML Schema; we
ain't got nowhere near that degree of difficulty (unless you want to be able
to specify XPath-like descents into XMLLiterals; I don't want to go there
myself, particularly given timeframes).

[2] On a technical note, I'm assuming that under the hood this boolean
function is called repeatedly and implicitly and presented with each
candidate Literal argument in turn, that a boolean result is returned for
each test, and that subject nodes on paths failing the test are then
dropped. I can also visualize a "bulk"-type argument-passing mechanism
(probably more efficient), in which all candidate literals are passed to the
function once en masse; what gets returned (this function has a different
signature from the one above) is the sequence of 0 or more literal nodes
that satisfy the query; only path containing those nodes are retained.

Received on Tuesday, 4 May 2004 19:15:11 UTC