Re: What is an RDF Query? from Sandro Hawke on 2001-09-10 (www-rdf-rules@w3.org from September 2001)

From: Sandro Hawke <sandro@w3.org>
Date: Mon, 10 Sep 2001 11:37:49 -0400
To: "Eric Prud'hommeaux" <eric@w3.org>
cc: www-rdf-rules@w3.org
Message-Id: <200109101537.f8AFbni31647@wadimousa.hawke.org>
Eric Prud'hommeaux wrote:
> On Fri, Sep 07, 2001 at 06:22:14PM -0400, Sandro Hawke wrote:
> > 
> > I agree that RDF queries and RDF rule premises are basically the same
> > things.  So what is an RDF query?
> > 
> > At a very abstract level, I think the RDF query API is something like:
> > 
> >     match(dataset, pattern) -> set of solutions
> > 
> > This vaguely matches every RDF query system I've heard of.  The
> > dataset is a set of RDF statements (triples), and the pattern is a set
> > of RDF statements (triples) which may have existential variable
> > elements.  A solution is either (1) a mapping from the variables to
> > constants or (2) a set of triples which match the pattern (that is,
> > with the variable subsitution done), or (3) both.  I think this is
> > equivalent to a relational join.
> > 
> > There is a shift in complexity if we go with the interpretation of RDF
> > "anonymous nodes" as existential variables.  That simplifies things by
> > saying the pattern is just an RDF graph like any other, but it
> > complicates things by allowing the dataset to have variables too.
> > This seems to be equivalent to trying to perform unification [1]
> > between the two sets as conjunctions of their triples, with the
> > complication that the elements have no intrinsic ordering.  (Does that
> > turn this into a much harder problem, or is there a trick to making it
> > not matter?)
> > 
> > This makes the match seem more symmetric, but it's still being able to
> > match all the triples in the second argument which constitutes
> > "success".   
> 
> Ambiguities will arise if anonymous nodes do double duty as variables
> and unlabled addresses in a graph 

I don't think so.  I think the only reasonable interpretation of
anonymous nodes is as existential variables which do very natural
double-duty in natural language and formal logic.  (If there's another
reasonable interpretation, I'd be interested to hear about it.)

You gave the example:

> <r:Description about="http://...bus_218">
>    <b:scheduledStop>
>       <r:Description>
>          <b:city>Boston</b:city>
>          <b:time>14:59EST</b:time>
>          <b:terminal>Z</b:terminal>
>       </r:Description>
>    </b:scheduledStop>
> </r:Description>
>
> would not say "bus 218 has a stop in boston at 14:59" but instead
> "I'm am talking about all of 218's stops in Boston at 14:59."  The
> statement would not be useful to a trip planner that didn't have
> an external assertion of the exsistence of this scheduled stop.

I think you're reading that as a universal ("for all ...") variable,
in this case.  (The existential will turn into a universal in a
minute, though, when we negate/conditionalize it to use this as a
query.)  So the RDF, asserted, says:

(1) There is something, globally known as "http://...bus_218" which
    has a scheduledStop, and that scheduledStop has a city of "Boston", a
    time of "14:59EST", and a terminal of "Z".

That's fine for simple constructions.  We've only got one unnamed
existential variable ("that" and its relatives) in english and I think
RDF/XML; if we needed to intermix data about several unnamed objects,
we would introduce temporary local-scope names (existential
variables):

(2) There is something, globally known as "http://...bus_218" which
    has a scheduledStop, which I'll refer to as Stop1 here.  Stop1 has
    a city of "Boston", a time of "14:59EST", and a terminal of "Z".

Without changing what is being asserted, we could go another step
which almost everyone does and given Stop1 a global name, a Skolem
constant (a genid):

(3) There is something, globally known as "http://...bus_218" which
    has a scheduledStop, globally known as
    "urn:uuid:67df927a-a5fc-11d5-93e4-0050ba4812a6", which has a city of
    "Boston", a time of "14:59EST", and a terminal of "Z".

All three versions mean the same thing as an assertion, but what if I
make it into a question?

(1q) Is there something, globally known as "http://...bus_218" which
     has a scheduledStop, and that scheduledStop has a city of "Boston", a
     time of "14:59EST", and a terminal of "Z"?

That makes perfect sense, and the answer (given the data in the
previous assertions) would be yes.  However, if I turn (3) into a
question:

(3q) Is there is something, globally known as "http://...bus_218" which
     has a scheduledStop, globally known as
    "urn:uuid:cc93d928-a5fd-11d5-8753-0050ba4812a6", which has a city of
    "Boston", a time of "14:59EST", and a terminal of "Z"?

the answer is No, because the Skolem constant is different (as it
probably has to be in any practical usage scenario).

So if you keep the anonymous nodes as existential variables (rather
than Skolemizing them with genid() as people have been doing) the
meaning of an asserted RDF document/graph does not change, and now you
can use the RDF graph in an additional meaningful way.

> It also doesn't say anything about the node set you've selected with
> the set of assertions containing a variable. We'll need something
> outside of (or above) the model to deliniate the selection from the
> assertions.

I think the only additional information we'll need is the question
being asked.    But maybe you can give me an example?

> Another problem is that there is no way in RDF/XML to assert multiple
> statements with a common anonymous node as the object. This limits the
> realm of expressible queries. For instance, this algae query that
> looks for members of groups that I trust would be inexpressible:
> 
> (ask '((http://...memberOf ?id          ?group)
>        (http://...trusts   http://...me ?group))
>  collect '(?id ?group))

Yes, RDF/XML is lame.  :-)  There's a trivial way to extend it here,
though, which is to adopt the N-Triples convention of using _:foo
where you would otherwise use a URI-Reference to identify something
with a document-scope identifier.  I'm sure there are some more clean
solutions, but they would probably not fit the style of RDF/XML so
well.   


> - anonymous nodes as variables only in pattern:
> 
> This seems to mostly work - I can't think of a reason to assert the
> existence of an anonymous node in a query.

Queries can't ever assert anything, can they?

(Although Fikes et al's query system has a temporary-assertion part
packaged with the query pattern, but I think that's different.  I'm
still trying to figure out what that's useful for.  Maybe it's for
asking hypothetical questions with a fairly weak logic?)

> The down side is that you can't make assertions about variables used
> in a query. If the same terms show up in the dataset, they identify
> something different. This solution also has the cost that queries must
> be rigorously sequestered from the dataset or the query will assert
> the very statements you are querying.  This would be true of statements
> in the query that don't have any variables at all (I don't know that
> these would exist, though).

I imagine queries always have to be rigorously sequestered from the
dataset, in all systems.  When they have an entirely different
structure, this may be more obvious.  In english, there's a big
difference in meaning between "Ralph is in his office." and "Ralph is
in his office?".   It's bad if we miss the last character (or tone of
voice) because the form is so similar.

And that above example is a useful query with no variables (internal
or returned).  I believe all queries can be done with no variables
returned, although it gets a little tedious extracting a long string
of bytes from a database purely through yes-or-no questions.


> - use something else for variables.
> 
> None of the query engines I have played with encode queries in
> RDF. This frees them up to use whatever they want to encoded
> variables. The problem is, naturally, there is little
> interoperability.  This limits not only the ability to use the same
> query in different environments, but also the ability to make formal
> assertions about queries and rules.

Yep.   This works, but it ends up more complicated than we need, if we
just handle existential variables properly.

I have no problem with reifying queries so they can live in datasets
safely, but that does point out that rdf's defined reification system
does not handle existential variables, unless we use the _:foo trick
there, too.  (Who came up with "_:" ?   Whoever it was -- thank you!)

> One could reify the statements in a query and define a new node type
> for variables. Following is an example of coding the above algae query
> as a series of s:Constraints which are subtypes of r:Statement. It is
> only slightly more verbose...
> 
> <r:Description>
>    <q:hasTerm>
>       <q:Constraint ID="1">
>          <s:Predicate r:resource="http://...memberOf" />
>          <s:Subject>
>             <q:Variable r:ID="?id" />
>          </s:Subject>
>          <s:Object>
>             <q:Variable r:ID="?group" />
>          </s:Object>
>       </q:Constraint>
>    <q:hasTerm>
>       <q:Constraint ID="1">
>          <s:Predicate r:resource="http://...trusts" />
>          <s:Subject r:resource="http://...me">
>          <s:Object>
>             <q:Variable r:ID="?group" />
>          </s:Object>
>       </q:Constraint>
>    </q:hasTerm>
> </r:Description>
> 
> The cool thing about this model is that it never asserts
>   ?id -----------http://...memberOf-> ?group
>   http://...me --http://...trusts---> ?group
> so it's safe to encounter in the dataset. This also means that one could
> make statements about the query which would probably be crucial in a lot
> of trust systems. Just an idea, have at.

Do you think rdf:subject, rdf:predicate, and rdf:object do the job
just as well, if we have an agreement what _:foo means?

Here's an example of saying that myQuery has a maxRunTime of 300
seconds (that part could be cleaner, of course), and the pattern to be
matched is the same as your pattern above.  I used daml:collection
because I think it's important to know you have all the peices of the
pattern before you try to satisfy it (and I don't think your
formalization conveys that information).

<r:Description about="...myQuery">
   <q:maxRunTime>300 seconds</q:maxRunTime>
   <q:pattern r:parseType="daml:collection">
      <r:Statement>
         <r:subject   r:resource="_:id" />
         <r:predicate r:resource="http://...memberOf" />
         <r:object    r:resource="_:group" />
      <r:/Statement>
      <r:Statement>
         <r:subject   r:resource="http://...me" />
         <r:predicate r:resource="http://...trusts" />
         <r:object    r:resource="_:group" />
      <r:/Statement>
   </q:pattern>
<r:Description>

Thoughts?

     -- sandro
Received on Monday, 10 September 2001 11:40:00 UTC