RE: Disjunction vs. Optional ... and UNION from Geoff Chappell on 2005-03-27 (public-rdf-dawg-comments@w3.org from March 2005)

From: Geoff Chappell <geoff@sover.net>
Date: Sun, 27 Mar 2005 10:47:49 -0500
To: "'Bob MacGregor'" <bmacgregor@siderean.com>
Cc: <public-rdf-dawg-comments@w3.org>, <andy.seaborne@hp.com>
Message-ID: <00ea01c532e4$54bc8360$6401a8c0@gsclaptop>
> -----Original Message-----
> From: Bob MacGregor [mailto:bmacgregor@siderean.com]
> Sent: Saturday, March 26, 2005 10:21 PM
> To: Geoff Chappell
> Cc: public-rdf-dawg-comments@w3.org
> Subject: Re: Disjunction vs. Optional ... and UNION
> 
> Geoff,
[...] 
> I think that 'x or not x' captures the spirit of what we want, and 'x or
> true' does also (modulo it produces
> too many bindings), but I'm no longer convinced that it actually works
> semantically.  If there really
> is a value meaning 'UNBOUND' (i.e., if its not just some kind of fiction
> in our semantics), then something
> in the 'not x' expression has to make that binding
> happen.  
> To be concrete, let  'x' correspond to '?x P ?y', and assume ?x
> is bound to 'a', then we are
> expecting that successful evaluation of
>      'not (a P ?y)'
> will somehow cause ?y to be bound to NULL, i.e., to the unbound value.
> I don't see any machinery that makes
> that happen.  Perhaps there is a discussion somewhere that I missed that
> says what the logical (e.g., model
> theoretic) formalism is that makes this binding occur?

I don't think you'll get away saying that a negation binds variables (though
a particular implementation may cause that to happen). Even more to the
point, I think any discussion of the action of variable binding necessarily
puts you in an implementation discussion vs. a (declarative) semantics
discussion.

I'll try again to explain how I understand this issue (and why I don't
believe there's a problem coming up with semantics for optional) with an
example:

Consider our favorite graph:

 _:a  foaf:name       "Alice" .
 _:b  foaf:name       "Bob" .
 _:b  foaf:mbox       <mailto:bob@work.example> .
 _:c  foaf:mbox       <mailto:noname@work.example.org> .

and the query:

 PREFIX foaf: <http://xmlns.com/foaf/0.1/>
 SELECT ?name ?mbox
 WHERE
   { ?x foaf:name  ?name .
     OPTIONAL { ?x foaf:mbox ?mbox } .
     ?y foaf:mbox ?mbox .
   }

Now consider the finite domain of things over which the variables range:

{_:a, _:b, _:c, foaf:name, foaf:mbox, "Alice", "Bob",
 <mailto:bob@work.example>,
 <mailto:noname@work.example.org> }

and consider the set of all vars in the query:

 {?x ?name ?mbox ?y}

and all of the possible values for those vars (i.e. letting each variable
range over the complete domain):

 {<_:a _:a _:a _:a>, <_:a _:a _:a _:b>, <_:a _:a _:a _:c>,...}

and consider that the constraints of the query serve to eliminate many of
those possibilities. You can see that you'll get the same results (variable
bindings for which all constraints hold) regardless of the order that you
apply the constraints. 

As an implementation strategy, this would be pretty naïve and a performance
disaster, but as a description of the semantics it seems perfectly
reasonable.

Now consider the similar query:

 PREFIX foaf: <http://xmlns.com/foaf/0.1/>
 SELECT ?name ?mbox
 WHERE
   { ?x foaf:name  ?name .
     OPTIONAL { ?x foaf:mbox ?mbox } .
   }

We can look at this query in a slightly modified non-standard form (and
since any logical expression can be rewritten in disjunctive normal form,
similarly we ought to be able to rewrite any sparql query as a union of
purely conjunctive queries):

 SELECT ?name ?mbox
 WHERE
   { ?x foaf:name  ?name .
     { ?x foaf:mbox ?mbox }.
   } 
   UNION
 SELECT ?name ?mbox
 WHERE
   { ?x foaf:name  ?name .
     NOT { ?x foaf:mbox ?mbox }.
   }

We can consider the meaning of each part independently, understanding that
the result of the whole query is the union of the results of the parts.

Again consider the set of vars in the query:

 {?x ?name ?mbox}

and all of the possible values for those vars:

 {<_:a _:a _:a >, < _:a _:a _:b>, < _:a _:a _:c>,...}

Now notice that the second part doesn't restrict mbox, so rather than
letting mbox range over the whole domain (which would yield a slew of
unhelpful results), we instead say that its range is a single special value
(NULL or ...).

Now if you accept this approach, I think you could fairly easily formalize
it to specify the semantics of sparql. I'll take a quick whack at it here
(but please understand my experience with such formalisms is limited so I
may make a hideous mess :-)

I believe the idea of tuples of possible variable bindings is captured with
the notion of a valuation - a function from the sets of variables to the set
of domain members (I think this is right - I looked through a database book
I had on the shelf and it's the closest formalism I could find that seemed
to say what I wanted to say).

Assuming each sparql query can be rewritten as the union of one or more
purely conjunctive queries, we can define the semantics of a single
conjunctive query and accept the whole query is defined as the union of its
parts.

Disregarding the projections, a single conjunctive query will look like this
(using a rule-like notation):

	Result(u) <- R1(u1),...,Rn(un)

where Rn represents one of several relations  - e.g.:TRIPLE, QUAD (for named
graph triples), or FILTER (for constraint expressions) - and each relation
is either positive or negated.

Then the meaning of the query over a RDF Dataset (DS) is:

 q(DS) = {v(u) | v is a valuation over the variables in q 
	and for each i in [1,n],
		v(ui) in DS(Ri), if Ri is positive,
		v(ui) not in DS(Ri), if Ri is negated}.

The valuation must incorporate the idea that range of a variable that is not
range-restricted is NULL (it is range restricted if it occurs in a positive
TRIPLE or QUAD relation, otherwise not).  
	
Does that make any sense?



> Recently, I came up with a scheme that isn't too distant from the 'not
> x' that
> WOULD cause the binding of ?y to NULL/UNBOUND.  I passed my idea on to
> Pat Hayes, because I trust him to
> tell me whether my proposal is viable or not, and I don't trust myself
> to know for sure.  He is really
> busy right now, but hopefully we will hear from him in a bit.
> 
> <snip>
> 
> >- sparql is a mixed model language - by that I mean that constraints are
> >functional, everything else logical (lp-style) (that's why there are two
> >ANDs, and two ORs). To the lp side of things, a FILTER clause looks like
> a
> >single logical relation - e.g.:
> >	FILTER ?a > 10 && ?b < 12
> >Can be thought of as:
> >	filter("?a > 10 && ?b < 12")
> >where filter will be true if the functional evaluation of the constraints
> >have an ebv of xs:true, false otherwise.
> >
> >
> KIF has functions, and does not make this distinction.  Given that
> simpler is better, I so far fail
> to understand why we need two of everything.  Somehow, its seems that
> SPARQL's functional constraints
> are inferior to the KIF equivalent.

I imagine the sparql constraint approach is driven by charter requirements
that the group reuses other specifications (e.g. xquery) where possible. I
didn't really like the design at first, but I've come to think that it's ok,
or at least not harmful (especially since you can achieve a purely logical
view by putting each constraint in its own FILTER clause).
 
> <snip>
> 
> > Nesting, disjunctions, and constraints aren't really an issue once the
> >
> >semantics of OPTIONAL are understood. An algorithm for evaluating a query
> >according to those semantic might be:
> >
> >Convert query to SOP form.
> >Order conjuncts in each product so that: positive triple pattern < filter
> <
> >negated triple pattern or filter.
> >Evaluate each product left to right and union the results.
> >
> >
> The above algorithm appears to assume that triple patterns and filters
> can be conjoined but
> not disjoined.  The messiness of the syntax makes it hard for me to be
> sure what is and
> isn't expressible but it sounds to me that, for example, a KIF statement
> (expressed in infix notation) such as
>      (?x P ?y) and (?z  P ?y) and ((?z Q ?y) or (?y > 5))
> is not expressible in SPARQL.

I think it can be as:

select ?x where {?x <P> ?y. ?z  <P> ?y. {{?z <Q> ?y} UNION {FILTER ?y > 5}}}

I also have a hard time figuring out exactly how to write queries (I usually
have to rely on my parser to tell me if I've used valid syntax). The syntax
is really ugly at the moment and I believe will impact adoption if it's not
corrected. Perhaps the biggest problem is that sparql is now curly bracket
happy (where it used to be parenthesis happy). I'd much rather see the above
query written as:

select ?x where {?x <P> ?y. ?z  <P> ?y.} 
	AND ({?z <Q> ?y} OR (FILTER ?y > 5))

Here's a summary of syntax changes that I think would dramatically improve
Sparql:
- use curlies to delimit turtle graph patterns only (no optionals or filters
allowed within curlies)
- recognize that there is an implicit AND between each of the triples of the
turtle graph pattern (so {?a ?b ?c} AND {?x ?y ?z} means the same as {?a ?b
?c. ?x ?y ?z}). 
- require an explicit AND to join logical factors
- UNION keyword becomes OR
- parens are used for grouping
- FILTER keyword is optional or goes away (I think...)
- NOT is added (I claim this is a syntax change only since the expressivity
is already there by virtue of OPTIONAL)


Rgds,

Geoff
Received on Sunday, 27 March 2005 15:48:10 UTC