Re: Updated definitions from Dan Connolly on 2005-05-18 (www-archive@w3.org from May 2005)

From: Dan Connolly <connolly@w3.org>
Date: Wed, 18 May 2005 13:56:56 -0500
To: "Seaborne, Andy" <andy.seaborne@hp.com>
Cc: www-archive@w3.org, Pat Hayes <phayes@ihmc.us>, Eric Prud'hommeaux <eric@w3.org>
Message-Id: <1116442616.3429.109.camel@localhost>
On Wed, 2005-05-18 at 14:53 +0100, Seaborne, Andy wrote:
> Formal Definitions from SPARQL Query Language for RDF

> Definition: Query Variable 
> 
> A named query variable is a member of the set V where V is infinite
>  and disjoint from RDF-T. V is otherwise arbitrary.
> 
> An unnamed query variable is a blank node in the query pattern and is
> disjoint from any blank node in the RDF graphs being matched.

That 2nd sentence seems like it belongs elsewhere.
Ah... I think it's best to leave it out of the abstract
definitions, and note in/near the grammar (or just
outside the box) that a parser
needs to come up with a variable name that's distinct
from all the others when it parses the [] construct.


> Definition: Graph Pattern 
> 
> A Graph Pattern is constructed from triple patterns and value
> constraints, using the following operations to combine patterns:
> 
>       * Basic Graph Pattern, set set of triple patterns 
>       * Group Pattern, where each of a set of graph patterns must all
>         match 
>       * Optional Graph Pattern, where a graph patterns may extend the
>         solution 
>       * Union Graph Pattern, where one of a set of graph patterns must
>         match 
>       * Patterns on Named Graphs, where patterns are matched against
>         named graphs

I think it suffices to say:

a Graph Pattern is either a Basic Graph Pattern,
a Group Pattern, and Optional Graph Pattern, ... 
or a Pattern on Named Graphs.

add to this list: a Constraint Pattern; more below.


> Definition: SPARQL Query 
> 
> A SPARQL query is a tuple consisting of a graph pattern, an RDF
> dataset description, a number of solution sequence modifiers and a
> result form.
> 
Dataset description? Not just a dataset?
hmm. I wonder where that comes in...
It doesn't seem to show up anywhere else.

As I think I suggested earlier, the semantics of FROM and FROM NAMED
should be specified by reference to the protocol, not as part of
the query language formalism.

That reminds me... the protocol spec needs to say how
to get from the default-graph-uri and the
named-graph-uri's to the corresponding graphs.
I'll try to remember to send mail about that separately.

> Definition: Query Pattern 
> 
> A query has one main graph pattern called the Query Pattern.
> 
A SPARQL Query has...

Or:

If (GP, D, M, R) is a SPARQL Query, then GP is its Query Pattern.

This whole definition might be left out, as it's reasonably
clearly implicit in the defintion of SPARQL Query.

> Definition: Triple Pattern 
> 
> A triple pattern is member of the set:
> (RDF-T union V) x (RDF-U union V) x (RDF-T union V)
> 
> Definition: Solution 

Solution or Substitution?
> 
> A substitution is a function from a subset of the set of variables to
> the set of RDF terms, RDF-T.
> 
> The result of replacing every v in a graph pattern P by S(v) is
> written S(P).
> 
> We say that S is a solution of P if P matches G with substitution S.
> 
> Definition: Basic Graph Pattern 
> 
> A Basic Graph Pattern is a set of Triple Patterns.
> 
> A basic graph pattern matches on graph G with substitution S if S(GP)
> is entailed by G.
> 
> Definition: Query Solution 
> 
> A Query Solution is a Pattern Solution for the Query Pattern. A
> substitution in a query solution only contains variables mentioned in
> the query.
> 
suggest:

Given a Query Pattern, a Query Solution is one of its Pattern Solutions
whose substitutions only contains variables mentioned in the Query
Pattern.

or:

Given a Query Q = (QP, D, M, R), S is a Query Solution for Q iff
S is a Pattern Solution for QP and every variable in the domain
of S is mentioned in QP.


> Definition: Query Results 
> 
> The Query Results, for a given graph pattern GP on graph G, is written
> R(GP,G), and is the set of all query solutions S such that S is a
> solution for GP on G.

hmm... rather: of all pattern solutions S, no?

I'm looking for the definition of Pattern Solution and I don't see it.
Hmm.

> R(GP, G) may be the empty set. 
> 
> Definition: Value Constraint 
> 
> A graph pattern may involve a value constraint, which is a
> boolean-valued expression of variables and RDF Terms that restricts
> query solutions.

Umm... "may involve"... the definition of Graph Pattern given
didn't involve value constraints. Suggest rather adding
Constraint Pattern above and:

Definition: Value Constraint

A Value Constraint is a boolean-valued expression of variables
and RDF Terms.

Definition: Constraint Pattern

A Constraint Pattern is a pair of a Value Constraint
and a Graph Pattern. S is a solution for a Constraint
Pattern (C, GP) iff C(S) is true and S is a solution for GP.



> Definition: Group Graph Pattern 
> 
> A Group Graph Pattern GP is a set of graph patterns, GPi.
> 
> A solution of Group Graph Pattern GP on graph G is any solution S such
> that, for every element GPi of GP, S is a solution of GPi.
> 
> Definition: Optional Graph Pattern 
> 
> An Optional Graph Pattern is a graph pattern that can create new
> solutions, but will always match.
> 
> Given graph pattern GP, the Optional Graph Pattern Opt(GP) matches
> with solution S if S is a solution for a match of GP, or S is not a
> solution for GP.

That doesn't work with my suggested clarification of Graph Pattern.
I think you had it better earlier, when an Optional Graph Pattern
had two parts:

An Optional Graph Pattern is a pair of graph patterns (R, O).
S is a solution for the optional graph pattern (R, O) iff
  - S is a solution for R and S is not a solution for O, or
  - S is a solution for the merge of R and O


> Definition: Union Graph Pattern 
> 
> A union graph pattern is a set of graph patterns GP i .
> 
> A union graph pattern matches a graph G with substitution S if there
> is some GP i such that GP i matches G with substitution S.

OK.

> Definition: RDF Dataset 
> 
> An RDF dataset is a set = { G, (u1, G1), (u2, G2), . . . (un, Gn) }
> where G and each Gi are graphs, and each ui is a URI. Each ui is
> distinct.
> 
> G is called the default graph. Gi are named graphs.

OK

> Definition: DataSet Graph Pattern 
> 
> If D is a dataset {G, (<u1> G1), ...}, and P is a graph pattern then S
> is a pattern solution of GRAPH(g, P) if:
> 
> g is a URI where g = <ui> for some i, and S is pattern solution of P
> on Gi or g is a variable, S maps the variable g to <ui> and S is a
> pattern solution of P on Gi.

OK

Hmm... all the other definitions of graph pattern solution are
given w.r.t. just a graph, not a whole dataset. Ooops.
We should specify how they work with a whole DataSet.

> Definition: Solution Sequence 
> 
> A solution sequence is a list of solutions.
> 
> The solution sequence from matching the query pattern is unordered.

I don't know how to make sense of that 2nd sentence. I think
it should be struck; earlier, the matches
of a query pattern is defined to be a set, not a sequence.

I think it'll be handy to have a notation for the set of solutions
in the sequence. Perhaps:

  For a solution sequence SQ = <S1, S2, S3, ...> we write SQ'
  for the set {S1, S2, S3, ...}


> Definition: Result Forms 
> 
> Based on the solution sequnc of the query pattern, there are a number
> of Result Forms that a SPARQL query can produce: 
>       * SELECT, where a table is produced 
>       * CONSTRUCT, where an RDF graph is constructed 
>       * DESCRIBE, where information about the resources located by the
>         query pattern is returned 
>       * ASK, where a boolean is returned indicating whether the
>         pattern matched or not.

Suggest:

A Result Form is one of the tokens SELECT, CONSTRUCT, DESCRIBE, or ASK.

Oops... later I discover that the CONSTRUCT form needs to carry
a pattern... and I guess the SELECT form needs to carry its
list of variables.


> Definition: Projection 
> 
> For a substitution S and a finite set of variables VS,
> project(S, VS) = { (v, S[v]) | v in VS }
> 
> For a query solution Q project(Q, VS) is { project(S, VS) | S in Q }
> 
> For a set QS of query solutions, project(QS, VS) is { project(Q, V) |
> Q in QS }
> 
> Definition: Distinct Solution Sequence 
> 
> A Distinct Solution Sequence has no two solutions the same.
> 
> Two solutions are the same if they associate the same named variables
> with teh same RDF-Terms.

That 2nd sentence shouldn't be part of the definition (since the
definition of Solution already gives identity condiditions); it could
be an informative note.

Ah... actually, I think you mean:

 SQ is an distinct solution sequence for Q = (GP, D, M, R)
 iff every S in SQ' is a solution for GP and either M
 has no DISTINCT modifier or every SQ[i] is distinct.

> Definition: Ordered Solution Sequence 
> 
> A ordered solution sequence is a solution sequence where the sequence
> is partially ordered with respect to some ordering condition. 

Er... every sequence is partially ordered with respect to some ordering
condition.

This term and definition seem redundant.

Perhaps you/we mean:

SQ is an ordered solution sequence for Q = (GP, D, M, R)
iff SQ is a distinct solution sequence for Q and SQ is in
order with respect to the ordering expressions in M.

plus, outside the box:

  Note that if there are no ordering expressions in M,
  every solution sequence is in order.


> Definition: Limited Solution Sequence 
> 
> A Limited Solution Sequence has at most a given, fixed number of
> members.

Again, except for infinite solution sequences, every solution
sequence has at most some given number of members.

Perhaps you/we mean:

SQ is an ordered solution sequence for Q = (GP, D, M, R)
iff SQ is an ordered solution sequence for Q and either
M has no LIMIT or SQ is shorter than the LIMIT given in M.

> Definition: Offset Solution Sequence 
> 
> An Offset Solution Sequence with respect to another solution sequence
> SEQ, is one which starts at a given index of SEQ.
> 
> The index of a sequence is index 0 (zero).

Suggest rather:

 If SQ1 is an ordered, limited solution sequence for
 Q1 = (GP, D, M1, R), then SQ2 is an offset solution
 for Q2 = (GP, D, M2, R) iff M2 is M1 plus OFFSET N
 and SQ2 is all but the first N items in SQ1,
 i.e. <SQ1[N], SQ1[N+1], SQ1[N+2], ...>.

 If M has no OFFSET modifier, than every limited
 solution sequence SQ of Q=(GP, D, M, R) is also
 an offset solution sequence of Q.


The set of solution sequence modifiers seems to be left
implicit. I suppose it's clear enough as is, but not
as rigorous as it could be; it could be defined
as: a tuple of (S, L, O, D) where L, the limit, is an integer
or infinity, O, the offset, is an integer, D is
a boolean, and S is a set of ordering expressions. Ordering
expressions could be further specified. Bleah. This is all obvious
and just about as well left implicit as is.

So.. is that the definition that the protocol spec
should reference? the hunk of XML you get back has
to represent an offset solution sequence of the
(abstract form, after parsing of the) given query.

Hm... that only covers the SELECT case. Er... actually,
we haven't incorporated the projection step, have we?

The ASK case is straightforward to specify
(ala if there are any non-empty offset
solution sequences, it's true, else false).

Somewhere we specify how to take a set of
solutions and a construct pattern and turn
it into a graph, right? That should be a
boxed definition... it should define
construct(P, S) from pattern P and
substitution S.

The protocol for DESCRIBE is kinda vacuuous:
every graph is a correct answer.
 
It seems kinda awkward to include the result form
in the abstract definition of SPARQL Query. Maybe it
works out OK... I suggest the following definitions
that the protocol spec and/or results format specs
should reference:

Definition: Answer Binding Sequence

 Given a query Q=(GP, D, M, SELECT VS),
 and an RDF dataset DS,
 SQ is an answer binding sequence of Q w.r.t. DS
 iff there is some offset solution sequence SQ2
 where each SQ[i] is project(SQ2[i], VS).

 There are no answer binding sequences for
 queries with result forms other than SELECT.

(hmm... is SQ2 unique? Should we be speaking
of *the* answer binding sequence? Ah... no...
not all queries give a total order.)

Definition: Boolean Answer

 Given a query Q=(GP, D, M, ASK) and an
 RDF dataset DS, If there is any non-empty offset
 solution sequence SQ of Q, then the the
 boolean answer is true, else it is false.

 The boolean answer of queries with results
 forms other than ASK is unspecified.

Definition: Graph Answer

 Given a query Q=(GP, D, M, CONSTRUCT P) and an
 RDF dataset DS, if SQ is an offset solution
 sequence of Q, then the merge of construct(P, SQ[i])
 for each i is a graph answer for Q given DS.

 Given a query Q=(GP, D, M, DESCRIBE) and
 an RDF Dataset DS, every RDF graph is
 a graph answer for Q given DS.

 The graph answers(s) for queries with result
 forms other than CONSTRUCT and ASK is unspecified.

Hmm... is merge right? We talked about whether
bnodes share in this case. I forget which way
it works. Maybe it should be union.




-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E
see you at XTech in Amsterdam 24-27 May?
Received on Wednesday, 18 May 2005 19:54:35 UTC