Re: Comments on SPARQL Query Language for RDF (draft)

Arjohn, Jeen,

Thanks for the comments.  I have incorporate the editorial ones: thi sreply only 
contains discussion poinrs.

Arjohn Kampman wrote:
> Dear SPARQL-editors and -enthusiasts,
> The following are comments on the editors working draft,

Comments against the editors' working draft are fine.

 >  revision 1.256

Thank you for including the version number.

> (2005/03/17), for "SPARQL Query Language for RDF". The feedback is
> inspired by our experience with the development and use of a number of
> query languages in Sesame[1], most notably SeRQL[2]. Apologies for
> coming up with such a long list of comments this late in the process,
> but we honestly haven't been able to find the time for a thorough review
> of the document until now. A number of editorial comments can be found
> at the end of this e-mail.
> Arjohn Kampman
> Jeen Broekstra
> [1]
> [2]
> General comments (in no specific order)
> ---------------------------------------
> - We are not very fond of SELECT-WHERE-FILTER construction. Considering
>    that the FROM keyword is no longer used for specifying datasets; how
>    about adopting the SQL-style SELECT-FROM-WHERE construction instead?
>    It could prevent confusion with people coming from a database world
>    that expect the WHERE-clause to contain boolean constraints.

The protocol will provide some means for specifying the target of a query so the 
matter has not changed.  I agree that people thinking that SPARQL is some 
strange SQL wil casue problems but at the same time, the the analogy is also 

Until SQL, FILTER can appear inside the pattern, and not in a separated clause, 
so app writer can place it next to the thing it affects if they wish to.

By the way, WHERE is actually an optional word.  You can write queries without 
it if you prefer.

> - The document suggests that (parts of) queries can only be evaluated on
>    a specific graph: either the background graph or a named graph. We
>    would have expected that, when no specific graph label is specified,
>    the query would be evaluated on the union of all graphs.

That set up is possible - make the background graph include the RDF merge of the 
named graphs - but it is not the only configuration of an RDF dataset.  The 
background graph is the knowledge base and includes the things the application 
is saying is its knowledge - it may not believe what's in some or all of the 
named graphs automatically.

 > The grammar
>    mentions a "GRAPH * ..." construction, which might be related to this
>    but which is not explained in the document.

Will make sure it is explained or removed from the grammar.

> - Named graphs are identified by URIs; bnodes or literals cannot be used
>    for this purpose. This forces application developers to generate URIs
>    when a simple string would be sufficient. Supporting literals as graph
>    names would allow developers to use simple string or datatyped dates
>    to tag specific sets of statements. Would this be useful?

Web resources are named by URI - the global uniqueness means that one system can 
communicate that name to another without confusion.

> - The definition of DESCRIBE is very loose: maybe too loose to be useful
>    in practice? An application developer would likely have a guarantee as
>    to whether the mechanism yields the info that is needed. As it is now,
>    the mechanism could very well result in the development of several
>    "DESCRIBE-dialects", which offer this guarantee for specific use
>    cases. We think a fixed definition like "it returns the bnode closure
>    for the concerning URIs" would be more useful.

There have been many definitions of a description and each seems to have some 
application domain assumptions.  The SPARQL protocol service description woudl 
be a place to state what a given service offers - the point about DESCRIBE is 
that it is not defined exactly by the client (c.f. CONSTRUCT).

Even "bnode closure" is tricky - FOAF is all bNodes.

We may see common descriptions emerging in various domains, such as LSID 

> - SeRQL offers default bindings for the often used prefixes 'rdf',
>    'rdfs' and 'xsd'. If not specified in the query itself, these prefixes
>    map to the standard RDF, RDF Schema and XML Schema namespaces. This
>    has proved to be very convenient. Is this a feature that should be
>    added to SPARQL too? We noted that the comment for version 1.244 of
>    the document mentions: "Removed text for default prefixes for rdf:
>    rdfs: owl: xsd:", but we we're unable to find a reason for this in the
>    mailing list archives.

It didn't seem to have sufficient support from within the WG.

> - The current specification allows only variables to be specified in the
>    SELECT-clause. However, on some occasions it can be very convenient
>    to be able to specify constants or functions in the projection. For
>    example:
>    * When an application fires two queries, one of them specifying a
>      default value (a constant) for tuples where that specific column
>      does not get a value from the graph. This becomes even more useful
>      when the UNION operator operates on queries instead of on graph
>      patterns (see later comments also).
>    * When an application is interested in the sum, product, etc. of two
>      or more fields, e.g. when converting from one currency to another.
> - Concerning the remark in section 3.2:
>      "Open: whether to allow "foo"@?v or ?v@fr or ?v^^xsd:integer or
>      "foo"^^?v".
>    When functions like STR(A) and LANG(A) would also be allowed in the
>    projection (see previous comment), this would give a good alternative
>    to the above constructions.

I'll pass this one on to the working group.  It does provide a way to remove the 
need for "foo"@?v.

> - The current specification describes a UNION operator that can be
>    applied to graph patterns, instead of to queries like is done in SQL.
>    This affects the expressivity of the query language when constants
>    and/or functions would be allowed in the projection. The following
>    example query, an alternative to the queries described in section 6.1,
>    illustrates this by using a constant in the projection:
>      PREFIX ...
>      SELECT ?title "1.0"
>      WHERE { ?book dc10:title ?title }
>      UNION
>      SELECT ?title "1.1"
>      WHERE { ?book dc11:title ?title }
>    The expected result of this query being:
>      title                             | version
>      ----------------------------------|--------
>      "SPARQL Protocol Tutorial"        | "1.1"
>      "SPARQL Query Language Tutorial"  | "1.0"

Firstly - SQL allows subqueries including in the (SQL) FROM clause which makes 
the distinction of where the UNION is much less distinct.

Second - if the application cares about where the the pattern matches, there are 
various ways of approaching this:

   { ?book dc10:title ?title_10 } UNION { ?book dc11:title ?title_11 }

so that one, and only one, of the variables will be available in each row.

> - There is a strong demand from the Sesame community to add ORDER BY and
>    GROUP BY/COUNT functionality to SeRQL. It's good to see that the
>    former has already been added to the editor's draft. However, we feel
>    that the latter is just as important. Having to transmit complete
>    query results only to be able to count specific rows adds a lot of
>    unnecessary network traffic and can really hurt performance.

Could you write this up as a use case?  What is being counted?  Individuals or 
names (URI labels, bNodes etc etc).

As a use case, even if the issue is not address in this round, it can be logged 
as a postponed issue.  In particular, there are strong closed world assumptions 
about applying aggregate functions so it would be good to understand as much 
about this requirement as possible.

> - Section 2.1 mentions:
>      "Prefixes apply to the query after they are defined; redefining a
>      prefix causes the new defintion to be used from that point in the
>      syntax."
>    The fact that prefixes apply to the query after they are defined is
>    trivial as prefixes must be defined at the start of a query (according
>    to the grammar). Allowing prefixes to be redefined doesn't seem to
>    make much sense in the context of SPARQL (this in contrary to Turtle).
>    Rather, it is more than likely that duplicate prefix declarations are
>    caused by slopiness on the account of the query writer (e.g.
>    copy-paste errors). This type of error is often very hard to detect,
>    therefore it would be wise disallow redefinition of prefixes and flag
>    the occurence of these as errors.
> - We have strong doubts about allowing blank nodes to be used as a kind
>    of anonymous variables. People that are new to the query language will
>    probably assume that specific bnodes can be specified in queries,
>    causing confusion when they find out that it doesn't work like that.

We can't stop that assumption.  The way to avoid the confusion is more down to 
how systems explain bNodes to people and making sure that bNode labels never 
escape to the application.  Many systems assume global (internal) labels to 
bNodes but this is an implementation technique they choose to use and so are 
responsible for.

>    Also, the extra notation for variables doesn't appear to add any
>    expressive power to SPARQL and seems to be a purely syntactic thing.

The bNodes arise from the RDF collection and N3 property lists syntax elements.

 From below:
 > Section A:
 > * We have a number of remarks concerning the grammar, which is ambiguous
 >    or at least needs unnecessary large look-aheads in a number of rules.
 >    However, we're not sure if the grammar is considered to be final
 >    enough for this kind of comments. Please let us know if you're
 >    interested.

The grammar is getting close.  There is a tradeoff to be had been expressing the 
grammar clearly and introducing extra, artificial states (they don't represent 
an abstraction the app writer thinks about) for some particular gramamr tool. 
The objective is not to be the grammar a particular system can just copy across.

Globally, the lookahead is 1 - locally, a parser may either wish to use extra 
states of locally increase lookahead.  What parsing mechanism are you using?

> Editorial comments
> ------------------

Noted and fixed where still relevant.


> Section 2.1:
> * typo: "...causes the new defintion to be..."
> * The query in "Data descriptions used in this document" is said to be
>    equivalent to the previous query, which is not true: this query
>    has a variable as subject, whereas the previous query has a URI.
> Section 2.4:
> * typo: "...where each of the tripe patterns matches..."
> Section 3.1:
> * All but the first query use ?v in the SELECT-clause without binding it
>    in the WHERE-clause.
> Section 3.2:
> * The query is said to be using a blank node as a variable, which is not
>    true.
> * typo: "A patten may be...". Also, the concerning sentence appears to
>    be formulated incorrectly.
> * "Note that a constraint can be considered to be a triple with a
>    special predicate." -- Superfluous remark? Why is this mentioned when
>    constraints cannot be written down as such?
> Section 4:
> * The definition of Graph Pattern includes Graph Pattern itself. Is this
>    correct?
> * typo: "A Basic Graph Patterns..."
> * typo: ", as described above, is..."
> * The second query uses the ';' character at the end of a triple pattern
>    but continues with another full triple pattern.
> Section 5:
> * typo: " be added to solution where..."
> Section 5.5:
> * The query is missing the ?mbox variable in the SELECT-clause.
> Section 6:
> * typo: "...provides a means combining..."
> * The queries in the subsections map the 'dc10' prefix to the DC 1.1
>    namespace and the 'dc11' prefix to the DC 1.0 namespace. This is not
>    logical and even makes the second query incorrect (when compared to
>    the described result).
> Section 7:
> * typo: "...hold a multiple RDF graphs..."
> * typo: "G is a called the..."
> * typo: "...does not need to described..."
> Section 8.1:
> * The 'data' prefix is defined but not used in the query.
> Section 8.3:
> * typo: "...whether in about GRAPH clause..."
> * typo: " one part of a querym..."
> * typo: " foudn in..."
> * typo: " a particualr..."
> Section 8.4:
> * typo: "...a aggregator has found read in a..."
> * The 'data' prefix is defined but not used in the query.
> Section 10.2:
> * This section covers serialization issues, specifically elaborating on
>    the fact that results can be serialized into XML or an RDF graph. We
>    feel that this part is a bit off-topic and that it would be better to
>    replace it with a simple reference to the SPARQL protocol WD. After
>    all, the work on the protocol isn't finished yet and it _might_ come
>    up with another solution.
> * "If both DISTINCT and LIMIT are specified, then duplicates are
>    eliminated before the limit is applied." -- OFFSET should also
>    be mentioned in this context.
> Section 10.3:
> * The first paragraph still mentions the "CONSTRUCT * ..." option.
> Section 11.1.1:
> * typo: "...considers the the following..."
> * typo: "...any r:Literal may be is cast to..."
> Section
> * typo: "...takes a boolean arguement..."
> * Table 11.1 documents the result type of the LANG(A) operator to be
>    rdf:uri. This should probably be xs:string?
> Section A:
> * We have a number of remarks concerning the grammar, which is ambiguous
>    or at least needs unnecessary large look-aheads in a number of rules.
>    However, we're not sure if the grammar is considered to be final
>    enough for this kind of comments. Please let us know if you're
>    interested.
> Section B:
> * Given the large number of similarities between SPARQL and SeRQL, it's
>    hard to imagine that SeRQL was not used as a reference language. If it
>    was used, we would really appreciate if a reference to SeRQL was added
>    to this section.

Received on Monday, 21 March 2005 13:26:11 UTC