Review of "rq24" reorg. of SPARQL Query Language for RDF (part 2) from Lee Feigenbaum on 2006-08-15 (public-rdf-dawg@w3.org from July to September 2006)

From: Lee Feigenbaum <feigenbl@us.ibm.com>
Date: Tue, 15 Aug 2006 00:39:05 -0400
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <OFAC9A878F.16E32287-ON852571CB.00194352-852571CB.00198940@us.ibm.com>
This is an early review of the reorganization of the SPARQL Query
Language for RDF specification known as rq24. I've divided the review
into comments on the overall structure and presentation of the document,
specific editorial comments on content in the document, and
layout/rendering nits. (Admittedly, some of the distinctions are a bit
arbitrary.) I have not attempted to review rq24 with respect
to substantive issues currently facing the working group, or as to the
correctness of the formal definitions. I have also not yet reviewed
section 11 Testing Values or the appendices.

In this note I present the editorial comments on content in the document.

Editorial:

+ Abstract. The abstract is not an abstract. The text provides a bit of
background material and perhaps a one-sentence summary of what the
SPARQL query language is. I'd suggest something like:

""" This document describes the query language part of the SPARQL
Protocol And RDF Query Language for easy access to RDF stores. It is
designed to meet the requirements and design objectives described in RDF
Data Access Use Cases and Requirements [UCNR] The SPARQL query language
consists of the syntax and semantics for asking and answering queries
against RDF graphs. SPARQL contains capabilities for querying triple
patterns, conjunctions, disjunctions, and optional patterns. It also
supports constraining queries by source RDF graph and extensible value
testing. Results of SPARQL queries can be ordered, limited and offset in
number, and presented in several different forms. """


+ 1.1.1 Namespaces. I think the prefixes in the table should include
colons. (Ex. "rdf:" rather than "rdf"). This facilitates searching for
the prefix declarations.


+ 1.1.2 Data Descriptions. Our reference for Turtle is to a document
under /2001/sw/DataAccess which is basically a pointer to
http://www.ilrt.bris.ac.uk/discovery/2004/01/turtle/ . Should we update
the reference to point to this document, or is the indirect reference
good enough?


+ 2 Making Simple Queries. This whole section talks about matching graph
patterns. I tend to think for a mini-primer this is OK, but it is at
odds with the formal definition which is now based on entailment. 


+ 2.2 Multiple Matches. "The results enumerate the RDF terms to which
the selected variables can be bound in the query pattern." As written,
this sentence seems to indicate that there are no other RDF terms to
which the variables could possibly be found. Perhaps it should be
qualified along the lines of:

"The results enumerate the RDF terms to which the selected variables can
be bound in the query pattern in order to match triples in the data."


+ 2.3.3 Matching Language Tags. The first query (with no solutions)
should have an empty solution set depicted underneath it for
completeness.


+ 2.4 Value Constraints. "It is possible to further restrict solutions
by constraining the allowable bindings of variables to RDF Terms." I'd
suggest removing "to RDF Terms" or rewriting as It is possible to
further restrict solutions by constraining the allowable RDF Terms to
which variables can be bound."


+ 2.4.1 Restricting the values of strings. I find the text here
confusing. Some suggestions:

""" 
One way to restrict the possible RDF literals is to use a regular
expression with the regex  operator.
"""
-->
"""
Variable bindings to RDF literals can be restricted to strings matching
a regular expression by using the regex operator.
"""

"""
Only plain literals with no language tag and XSD strings are matched by
regex but it is possible to get the lexical form of a literal using str.
"""
-->
"""
The regex operator only matches <code>xsd:string</code> typed
literals or plain literals with no language tag. regex can match against
the lexical forms of other literlas by using the str operator.
"""

"""
which may be made case-insensitive with the "i" flag.
"""
-->
"""
Regular expression matches may be made case-insensitive with the "i"
flag.
"""


+ 2.4.2 Restricting the values of numbers. I don't think we ever refer
to the "presentation" of a literal somewhere else. I suggest:

"""
Filters apply to the value of the literal, not its lexical form.
"""

In general the text refers to variables directly by name, without
quotation marks, so <code>"price"</code> should be simply
<code>price</code>.

I'd suggest:

""" 
By contraining the <code>price</code> variable, only <code>book2</code>
matches the query because only <code>book2</code> has a price less than
<code>30.5</code>, as the filter condition requires.
"""

+ 2.6 Querying Reification Vocabulary. I think it might be worth a note
that says that SPARQL does not treat the reification vocabulary terms
specially. Something like:

"""
Note that SPARQL does not treat querying reified data any differently
from any other data. As with other data, SPARQL can be used to query
graph-pattern matches using the reification vocabulary.
"""

+ 3.1.1 Syntax for IRIs. 

"""
Prefixed names

The PREFIX keyword associates a prefix label with an IRI. A prefixed
name is a prefix label and a local part, separated by a colon ":". It is
mapped to an IRI by concatenating the local part to the IRI
corresponding to the prefix.
"""

I think it's worth adding "(possibly empty)" before "prefix label". I
think that "prefix" at the end of this paragraph should be "prefix
label."

Later in this section, three examples of different wayts to write the
same IRI are given. In the BASE and PREFIX cases, I'd suggest adding a
"..." line in between the PREFIX/BASE clause and the IRI reference, to
emphasize that this is simply an excerpt of SPARQL using these
abbreviation mechanisms.

+ 3.1.2 Syntax for Literals. The introductory text and bulleted examples
should discuss the triple qotation mark version of literals. Perhaps
text like:

"""
To facilitate writing literal values which themselves contain quotation
marks, SPARQL provides an additional quoting construct in which literals
are enclosed in three single- or double-quotation marks.
"""

And then an example such as:

"""The librarian said, "Perhaps you would enjoy 'War and Peace.'""""

+ 3.1.4 Syntax for Blank Nodes. I feel that this section is a bit
confused between whether it wants to define the syntax in terms of blank
node labels only (leaving the mapping between labels and blank nodes to
elsewhere in the spec), or in terms of blank nodes themselves. (I think
someone (FredZ?) suggested a reworking of some of the BGP-matching
definitions that assumed that we only worked with blank nodes there,
which would require that this section fully explains how to map from
syntactic constructs to blank nodes. (But that would be difficult since
at this point the concept of a BGP has not yet been introduced.))

I'd be glad to take a stab at rewriting some of the text here to
explicitly map only from syntactic constructs to blank node labels at
this point in the spec if that would be helpful. If we went this route,
I think that 5.4 Basic Graph Patterns in the SPARQL Syntax might be an
appropriate place to include text explaining how blank node labels map
to blank nodes.

+ 3.2.2 Object Lists. This section includes the sentence:

"""
Note that both the triple patterns involving foaf:nick will need to
match, not that one or the other should match.
"""

I'd suggest removing this sentence. This section of the document is
purely syntactic in nature, and this sentence bleedsinto the territory
of matching triple patterns, which has not been introduced yet.

+ 3.2.3 RDF Collections.

"""
RDF collections can be written in triple patterns using the syntax "(
)". The form () is an alternative for the IRI rdf:nil which is
http://www.w3.org/1999/02/22-rdf-syntax-ns#nil. When used with
collection elements, such as (1 ?x 3 4), triple patterns and blank nodes
are allocated for the collection and the blank node at the head of the
collection can be used as a subject or object in other triple patterns.
"""

First, we've already defined the rdf: prefix for the extent of this
document, so I think including the full IRI is unnecessary here. 

Second, as with 3.1.4, I think this section should be worded in terms of
allocating blank node *labels* that do not otherwise appear in the
query. This maintains a clean separation between the syntactic concerns
of section 3 and the semantic concerns of most of the rest of the
document.

+ 4 Initial Definitions. "RDF Concepts and Abstract Syntax "anticipates
an RFC on Internationalized Resource Identifiers. Implementations may
issue warnings concerning the use of RDF URI References that do not
conform with [IRI draft] or its successors."" That sentence seems out of
the blue to me. It could use some motivation.

+ 4.1 RDF Terms. 
 
Why does the word "updated" link to the section in RDF Concepts about
URI refs?

IRIs include URIs [RFC3986] and URLs." Don't IRIs include URLs simply by
virtue of URLs being a subset of URIs? (There's actually at least one
other place in the document where I noticed this, but didn't comment on
it.)

+ 4.2 Triple Patterns. "Any SPARQL triple pattern with a literal as
subject will fail to match on any RDF graph." While this is true, it's
really a consequence of how matches are defined, which we haven't seen
yet. I'd ether remove this sentence, or at least change it to say
"Because RDF graphs may not contain literal subjects, any
SPARQL triple pattern with a literal as a subject will fail to match any
RDF graph."

+ 4.4 Value Constraints. BOUND is a special case here, which doesn't fit
into what's described here. (Because it acts on the variable, not on a
value or an RDF term.) Perhaps it should be explicitly mentioned?

+ 5.3 Examples of Basic Graph Pattern Matching. This contains the text:

"""
There is a blank node [CONCEPTS] in this dataset, identified by _:a. The
label is only used within the file for encoding purposes. The label
information is not in the RDF graph.
"""

Thisis superfluous with explanations in section 3. I think these
sentences should be removed.

+ 6 Group Graph Patterns. I agree with the @@ in the document that the
summary of graph-pattern types at the beginning of this section can be
removed now that it is basically repeated in 4 Initial Definitions.

+ 6.1 Group Graph Patterns. It would be nice if the definition of Group
Graph Pattern used the abbreviation GGP instead of GP which is usually
used for graph patterns (that are not necessarily group graph patterns).

"""
For any solution, the same variable is given the same value everywhere
in the set of graph patterns making up the group graph pattern. For
example, this query has a group graph pattern of one basic graph pattern
as the query pattern.

In a SPARQL query string, a group graph pattern is delimited with
braces: {}. 
"""

I think that the middle sentence belongs at the end and can be
clarified. Perhaps:

"""
For any solution, the same variable is given the same value everywhere
in the set of graph patterns making up the group graph pattern. 

In a SPARQL query string, a group graph pattern is delimited with
braces: {}. For example, the query pattern for this query is a single
group graph pattern. This group graph pattern contains a single basic
graph pattern, which in turn contains two triple patterns.
"""

+ 9 RDF Dataset. s/comprises of/comprises/ (in American English, at
least :-). 

+ 10.1 Solution Sequences and Result Forms. It'd be nice if something
here linked back to the definition of a pattern solution from section 4,
perhaps around the phrase "each solution being a function from variables
to RDF terms."

+ 10.1.3 DISTINCT. I'd suggest adding a sentence to the effect that
the DISTINCT keyword/modifier can only be used with the SELECT result
form.

+ 10.2 Selecting Variables. I'd prefer something like "Selecting Variable
Bindings," and a similar change to the first sentence: "The SELECT form
of results returns the variables directly."

+ 10.2 Selecting Variables.

""" 
Result sets can be accessed by the local API but also can be
serialized into either XML or an RDF graph.
"""

Results *can* be serialized in any number of other ways, also. I think
that "the local API" is confusing since there's no other reference to
such a creature. Maybe "a local API" is better, or no mention at all. I
think just sayiing that SPARQL Query Results XML Format provides one
serialization of SELECT results in an XML vocabulary would suffice and
be less confusing.

"""
The syntax SELECT * is an abbreviation that selects all of the
variables.
"""
--->
"""
The syntax SELECT * is an abbreviation that selects all of the
variables that appear in the query.
"""

+ 10.3 Constructing an Output Graph. Since the section before talks
about a serialization of the results, I wonder if this section should
have something to say about the SPARQL query language specification not
constraining the serialization of the graph resulting from a CONSTRUCT
query. Perhaps a pointer to the appropriate part of the protocol
document? Similarly for DESCRIBE in 10.4.

+ 10.4 Description of resources. This section should say something about
DESCRIBE *, along the lines of what is said for SELECT *.

+ 10.4.2 Identifying Resources. This says: "If, however, the query
pattern has multiple solutions, the RDF data for each is the union of
all RDF graph descriptions." I know that DESCRIBE is underspecified,
but wonder if it would be safer to say "merge" here rather than "union"?
Or perhaps "union" is purposeful here to allow descriptions of different
terms to share blank nodes?


Lee
Received on Tuesday, 15 August 2006 04:39:19 UTC