[Fwd: Re: RDF Data Access Working Group : first working draft of SPARQL] from Seaborne, Andy on 2004-10-17 (public-rdf-dawg@w3.org from October to December 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Sun, 17 Oct 2004 18:27:31 +0100
To: 'RDF Data Access Working Group' <public-rdf-dawg@w3.org>
Message-ID: <4172AB83.3020006@hp.com>
Comments from Graham Klyne: discussion for the WG of the issues he raises 
inline.

	Andy

Graham Klyne wrote:
> At 15:08 13/10/04 +0100, Seaborne, Andy wrote:
> 
> 
> 
>>The RDF Data Access Working Group is happy to announce the first working
>>draft of the query language part of its work:
>>
>>   SPARQL RDF query language
>>   http://www.w3.org/TR/rdf-sparql-query/
>>
>>The Working Group is soliciting feedback on this early draft ...
> 
> 
> On first glance, it's looking good to me.  Here are some random thoughts:
> 
> ...
> 
> 1. Is the SELECT clause really useful?  My implementations return all 
> variable bindings from the query, and I simply ignore those I don't want.
> 
> ...

Locally, that is true - I'm sure that when the query processor and the
application are in the same process, the QP may ignore the SELECT (I know 
I do for everything except presenting results - anything else would be 
pure overhead).

When results are encoded to be sent over the network, reducing the number
of variables in each query solution can reduce the number of bytes needed
to be sent.

Presentation of query results may also be informed by the SELECT clause, 
such as removing variables which will be bNodes (e.g. FOAF data) and other 
variables introduced solely for path creation in the query itself.


> 
> 2. In section 2.2: "Not every binding needs to exist in every row of the 
> table.".  I think this is an important feature whose presence should be 
> very clear.  Currently, it seems a bit buried.

Good Point.  Should do that.


> 
> ...
> 
> 3.  I think the terminology around "Definition: Triple Pattern Matching" is 
> a bit muddled.  Is a "binding" a substitution for a *single* variable, or a 
> tuple of variables?  (I think you mean the former)  I think it's important 
> to be very clear about this, and have clear terms corresponding to:
>    (a) a single name->value binding (a "cell")
>    (b) a tuple of name->value bindings, with no name repeated (a "row")
>    (c) a set of tuples of name->value bindings, (with no tuple repeated 
> under permutation?) (a "table")
> 
> These are distinctions I've found to be important to keep clear in my 
> implementation work.

Agreed - it's muddled.  The terminology here is important and we will revisit.

It should be:
a) binding (a single name/value pair)
b) set of bindings - pattern solution when the set of bindings gives the
way a pattern matches
c) query results, where the bindings are saying how a pattern was matched


It would be useful if people could make suggestions for names of things here.

As noted below, "query results" may be confusing as it is only referring
to the pattern of the query - not what the application might see when the
query form is applied.

> 
> ...
> 
> 4. In section 2.2: "If the same variable name is used more than once in a 
> pattern then, within each solution to the query, the variable has the same 
> value."  This, too, I think is important to keep clearly stated.

Will do.

> 
> ...
> 
> 5. I note that variables are allowed in predicate position.  If this 
> doesn't present any problems, I'm all in favout of this, but I think the 
> design decision could be highlighted more clearly.

It hadn't occurred to me that it might not be possible.  I'm not aware of
any issues arising.  That feature is available in several existing query
languages.

> 
> ...
> 
> 6. Can the resulting variable bindings contain repeated 
> binding-tuples;  e.g. in response to a query like:
>     SELECT ?a ?c
>     WHERE  ( ?a ?b ?c )
> against the graph:
>     :s1 :p1 :o1 .
>     :s1 :p2 :o1 .

Yes - there can be repeated rows in the table.  It's a bag by the time
SELECT has projected out any variables.

There seems to be a problem with terminology that needs correcting.  The
term "query solution" is used but it is confusing where it applies.  At
least the Query Results definition either has to use "bag" or be clear 
what it applies to.  Alternative naming might also help.

There are two solutions (sets of bindings for variables "a" "b" and "c") 
to the pattern match.  Query solutions are not effected by the query 
result form.

The query form takes solutions and transforms them into the
application-level results.  (Implementations may, of course, use all
information available in the query request and dataset to perform
optimizations to query execution.)

SELECT projects just the "a" and "b" binding in each.  SELECT does not
change the number of rows in the table; SELECT DISTINCT does in this case.

The language in the document is clearly confusing and I will go back and
find better wording and terminology in the pattern matching sections and
query form sections.

[Steve - you may wish to comment here]

> Later, you mention that a query result is a set, so I guess that means no 
> duplicates, but I haven't yet seen this stated more explicitly.

I can see that there is a confusion here : that text is in the sections on
pattern matching and that is not depend on details of the query form.  A
restriction of variables in SELECT does not change the solutions.

Need to rework terminology around "query solution" and "query results".

> 
> Later, you introduce SELECT DISTINCT, so I guess that means a simple query 
> result can have duplicate binding-tuples.  So it's not a set.
> 
> ...
> 
> 7. Section 4
> 
> I note you've chosen to allow optional elements of graph patterns, but not 
> alternatives.  In one of my implementations I provided alternative blocks, 
> where the last alternative could be empty, hence also providing optional 
> patterns.  Alternatives are permitted to bind the same variable, thus 
> providing ways to match different (graph-syntactical) expressions of the 
> same information.  I have sometimes found this to be useful, but it does 
> somewhat mess up the clean semantics of the approach you have adopted.

:-)

> 
> Despite the semantic messiness, I do feel that having some capability to 
> select one possible match over another, when dealing with possibly messy 
> real-world data, could be useful enough to justify the consequent 
> complication of query optimization when such a feature is used.

A concrete example would help me because I would have though optionals
were exactly that complication.  There is a case for disjunction also in 
Graham's argument.

The "pattern OR nothing" form does not quite give the same as optional
if OR is union-like as it would give the "nothing" solution as well
as the pattern matching solution when the pattern matched.

> 
> ...
> 
> 8. Section 8
> 
> The current position seems about right to me.  Complicating the basic query 
> mechanism to handle "accessing direct subclass relationship" seems 
> undesirable and unnecessarily:  presenting a graph with (notional) explicit 
> types (etc.) where implied by subclass relationships seems to me to be 
> sufficient.
> 
> ...
> 
> Section 9.
> 
> Constraining the source of a pattern seems to be only a (small) part of the 
> provenance story.  Is it not also desirable to query the source.
> 
> Oops!  I now see that <source> can be a variable.  OK, that's neat, and 
> works cleanly at the natural unit of provenance, viz the statement.

Our unit of provenance could also be viewed as the subgraph because graphs 
are the unit of exchange.  Need to take this on board for the next round 
of drafting.

> 
> Is it fair to assume that support for SOURCE may be optional?  Ah yes, if 
> unsupported, bind source variables to NULL.

Yes - in some way.  May not be as currently shown.

>  If a statement occurs in more 
> than one source with a source variable pattern, does that result in 
> multiple variable-binding-tuples?  (I think it should.)

Yes.

> 
> e.g. the pattern:
>    SOURCE ?ppd ( ?whom foaf:age ?age )
> might return
>    :source1 :Jenny foaf:age "10"
>    :source2 :Jenny foaf:age "10"
>    :source3 :Jenny foaf:age "11"
> etc.
> 
> ...
> 
> Section 11
> 
> I think this might better be titled "result forms".

Good idea.

> 
> Is it intended that every SPARQL must support every result form?  I think 
> that could add unnecessary implementation complexity.  I think there should 
> be one form supported by all implementations, and SELECT seems a reasonable 
> choice.  I don't really see a compelling case for requiring the the others 
> to be universally available.
> 
> I think the ASK result form is also reasonable.
> 
> Thought:  if a query pattern has no variables, is there a distinction for 
> SELECT * result when the query is matched or not matched.  I think there 
> should be:
> 
>      {}    query not matched.
>      {<>}  query matched, empty variable binding tuple.
> 
> ...

Those answers would be the right ones.

Should turn this into a test case.

> 
> Section 11.3
> 
> I'm uneasy about the DESCRIBE feature.  It seems to be going rather beyond 
> the basic idea of RDF graph query, and doen's seem to have well or clearly 
> defined semantics.
> 
> I think the effort here might be better applied to query language 
> extensions that permit some kind of recursively-defined pattern, so that 
> various kinds of sub-graph neighbourhoods can be described according to an 
> applications requirements.  A simple use-case would be to describe the 
> entire content of an rdf:collection from just its head element.

The DESCRIBE form means that the client does not set the shape for
the query result graph - it may not know and will analysis the graph returned.

An example in the doc should help as would test cases and a fuller text.

We have to address whether there needs to be any support for returning
collections and containers even in SELECT.

In RDF there are two paradigms, one of statements, but for the application
writer there is also the concepts of collection and container.  Returning
a located list could be reasonable as would returning all its elements.

> 
> ...
> 
> Section 12.
> 
> Testing values.  Is there a way to combine tests with non-struct 
> evaluation, so that something like:
> 
>     AND isBound ?x AND ?x < 20
> 
> can be reliably processed?

This is covered in newer drafts in sec 12.  If ?x is unbound, isBound is
false so the result is false (as in evaluation of ?x < 20).  Evaluation 
involving unbound variables is false unless otherwise noted (e.g. unbound())

That's a point about whether AND and && are *exactly* the same.

> 
> ...
> 
> Section 12, "Are tests syntax for RDF predicates or separate concepts?"
> 
> This makes me uneasy.  I feel that there may be tests that are not easily 
> or naturally presented as RDF syntax.  Probably with enough contorion it 
> can be managed, but is it helpful?  How does a test like "isBound ?x" play 
> here?
> 
> Part of my viewpoint here is that there should be, as far as possible, a 
> clear separation between structure within RDF literal values and structure 
> that is expressed within the RDF graph.

That's one point of view - other people see it the other way round.  Not 
sure the degree to which this matters - may be able to be neutral.

> (For this reason, I'm not 
> enthusiastic about using XML schema structured datatypes as RDF literals, 
> when the structure over the component values could be quite naturally 
> expressed using RDF statements.  This leads me to think that the query 
> language tests here should really be trying to capture those things that 
> aren't comfortable captured as RDF properties.)

XML schema structured datatypes will probably be accessible only via 
extensions.  They are not in the basic set of functions and operators.

((I don't think that we have the remit to go one way or the other on XML
schema structured datatypes))

> 
> ...
> 
> That's it, for now.

Useful comments

> 
> #g
> 
> 
> 
> ------------
> Graham Klyne
> For email:
> http://www.ninebynine.org/#Contact
> 
>
Received on Sunday, 17 October 2004 17:28:03 UTC