Comments on SPARQL Query Language for RDF (draft)

Dear SPARQL-editors and -enthusiasts,

The following are comments on the editors working draft, revision 1.256
(2005/03/17), for "SPARQL Query Language for RDF". The feedback is
inspired by our experience with the development and use of a number of
query languages in Sesame[1], most notably SeRQL[2]. Apologies for
coming up with such a long list of comments this late in the process,
but we honestly haven't been able to find the time for a thorough review
of the document until now. A number of editorial comments can be found
at the end of this e-mail.

Arjohn Kampman
Jeen Broekstra

[1] http://www.openrdf.org/
[2] http://www.openrdf.org/doc/SeRQLmanual.html


General comments (in no specific order)
---------------------------------------

- We are not very fond of SELECT-WHERE-FILTER construction. Considering
   that the FROM keyword is no longer used for specifying datasets; how
   about adopting the SQL-style SELECT-FROM-WHERE construction instead?
   It could prevent confusion with people coming from a database world
   that expect the WHERE-clause to contain boolean constraints.

- The document suggests that (parts of) queries can only be evaluated on
   a specific graph: either the background graph or a named graph. We
   would have expected that, when no specific graph label is specified,
   the query would be evaluated on the union of all graphs. The grammar
   mentions a "GRAPH * ..." construction, which might be related to this
   but which is not explained in the document.

- Named graphs are identified by URIs; bnodes or literals cannot be used
   for this purpose. This forces application developers to generate URIs
   when a simple string would be sufficient. Supporting literals as graph
   names would allow developers to use simple string or datatyped dates
   to tag specific sets of statements. Would this be useful?

- The definition of DESCRIBE is very loose: maybe too loose to be useful
   in practice? An application developer would likely have a guarantee as
   to whether the mechanism yields the info that is needed. As it is now,
   the mechanism could very well result in the development of several
   "DESCRIBE-dialects", which offer this guarantee for specific use
   cases. We think a fixed definition like "it returns the bnode closure
   for the concerning URIs" would be more useful.

- SeRQL offers default bindings for the often used prefixes 'rdf',
   'rdfs' and 'xsd'. If not specified in the query itself, these prefixes
   map to the standard RDF, RDF Schema and XML Schema namespaces. This
   has proved to be very convenient. Is this a feature that should be
   added to SPARQL too? We noted that the comment for version 1.244 of
   the document mentions: "Removed text for default prefixes for rdf:
   rdfs: owl: xsd:", but we we're unable to find a reason for this in the
   mailing list archives.

- The current specification allows only variables to be specified in the
   SELECT-clause. However, on some occasions it can be very convenient
   to be able to specify constants or functions in the projection. For
   example:
   * When an application fires two queries, one of them specifying a
     default value (a constant) for tuples where that specific column
     does not get a value from the graph. This becomes even more useful
     when the UNION operator operates on queries instead of on graph
     patterns (see later comments also).
   * When an application is interested in the sum, product, etc. of two
     or more fields, e.g. when converting from one currency to another.

- Concerning the remark in section 3.2:
     "Open: whether to allow "foo"@?v or ?v@fr or ?v^^xsd:integer or
     "foo"^^?v".
   When functions like STR(A) and LANG(A) would also be allowed in the
   projection (see previous comment), this would give a good alternative
   to the above constructions.

- The current specification describes a UNION operator that can be
   applied to graph patterns, instead of to queries like is done in SQL.
   This affects the expressivity of the query language when constants
   and/or functions would be allowed in the projection. The following
   example query, an alternative to the queries described in section 6.1,
   illustrates this by using a constant in the projection:

     PREFIX ...
     SELECT ?title "1.0"
     WHERE { ?book dc10:title ?title }
     UNION
     SELECT ?title "1.1"
     WHERE { ?book dc11:title ?title }

   The expected result of this query being:

     title                             | version
     ----------------------------------|--------
     "SPARQL Protocol Tutorial"        | "1.1"
     "SPARQL Query Language Tutorial"  | "1.0"

- There is a strong demand from the Sesame community to add ORDER BY and
   GROUP BY/COUNT functionality to SeRQL. It's good to see that the
   former has already been added to the editor's draft. However, we feel
   that the latter is just as important. Having to transmit complete
   query results only to be able to count specific rows adds a lot of
   unnecessary network traffic and can really hurt performance.

- Section 2.1 mentions:
     "Prefixes apply to the query after they are defined; redefining a
     prefix causes the new defintion to be used from that point in the
     syntax."
   The fact that prefixes apply to the query after they are defined is
   trivial as prefixes must be defined at the start of a query (according
   to the grammar). Allowing prefixes to be redefined doesn't seem to
   make much sense in the context of SPARQL (this in contrary to Turtle).
   Rather, it is more than likely that duplicate prefix declarations are
   caused by slopiness on the account of the query writer (e.g.
   copy-paste errors). This type of error is often very hard to detect,
   therefore it would be wise disallow redefinition of prefixes and flag
   the occurence of these as errors.

- We have strong doubts about allowing blank nodes to be used as a kind
   of anonymous variables. People that are new to the query language will
   probably assume that specific bnodes can be specified in queries,
   causing confusion when they find out that it doesn't work like that.
   Also, the extra notation for variables doesn't appear to add any
   expressive power to SPARQL and seems to be a purely syntactic thing.

Editorial comments
------------------

Section 2.1:
* typo: "...causes the new defintion to be..."
* The query in "Data descriptions used in this document" is said to be
   equivalent to the previous query, which is not true: this query
   has a variable as subject, whereas the previous query has a URI.

Section 2.4:
* typo: "...where each of the tripe patterns matches..."

Section 3.1:
* All but the first query use ?v in the SELECT-clause without binding it
   in the WHERE-clause.

Section 3.2:
* The query is said to be using a blank node as a variable, which is not
   true.
* typo: "A patten may be...". Also, the concerning sentence appears to
   be formulated incorrectly.
* "Note that a constraint can be considered to be a triple with a
   special predicate." -- Superfluous remark? Why is this mentioned when
   constraints cannot be written down as such?

Section 4:
* The definition of Graph Pattern includes Graph Pattern itself. Is this
   correct?
* typo: "A Basic Graph Patterns..."
* typo: "...is, as described above, is..."
* The second query uses the ';' character at the end of a triple pattern
   but continues with another full triple pattern.

Section 5:
* typo: "...to be added to solution where..."

Section 5.5:
* The query is missing the ?mbox variable in the SELECT-clause.

Section 6:
* typo: "...provides a means combining..."
* The queries in the subsections map the 'dc10' prefix to the DC 1.1
   namespace and the 'dc11' prefix to the DC 1.0 namespace. This is not
   logical and even makes the second query incorrect (when compared to
   the described result).

Section 7:
* typo: "...hold a multiple RDF graphs..."
* typo: "G is a called the..."
* typo: "...does not need to described..."

Section 8.1:
* The 'data' prefix is defined but not used in the query.

Section 8.3:
* typo: "...whether in about GRAPH clause..."
* typo: "...in one part of a querym..."
* typo: "...as foudn in..."
* typo: "...to a particualr..."

Section 8.4:
* typo: "...a aggregator has found read in a..."
* The 'data' prefix is defined but not used in the query.

Section 10.2:
* This section covers serialization issues, specifically elaborating on
   the fact that results can be serialized into XML or an RDF graph. We
   feel that this part is a bit off-topic and that it would be better to
   replace it with a simple reference to the SPARQL protocol WD. After
   all, the work on the protocol isn't finished yet and it _might_ come
   up with another solution.
* "If both DISTINCT and LIMIT are specified, then duplicates are
   eliminated before the limit is applied." -- OFFSET should also
   be mentioned in this context.

Section 10.3:
* The first paragraph still mentions the "CONSTRUCT * ..." option.

Section 11.1.1:
* typo: "...considers the the following..."
* typo: "...any r:Literal may be is cast to..."

Section 11.2.0.1:
* typo: "...takes a boolean arguement..."
* Table 11.1 documents the result type of the LANG(A) operator to be
   rdf:uri. This should probably be xs:string?

Section A:
* We have a number of remarks concerning the grammar, which is ambiguous
   or at least needs unnecessary large look-aheads in a number of rules.
   However, we're not sure if the grammar is considered to be final
   enough for this kind of comments. Please let us know if you're
   interested.

Section B:
* Given the large number of similarities between SPARQL and SeRQL, it's
   hard to imagine that SeRQL was not used as a reference language. If it
   was used, we would really appreciate if a reference to SeRQL was added
   to this section.

Received on Friday, 18 March 2005 16:13:38 UTC