- From: Peter F. Patel-Schneider <pfps@research.bell-labs.com>
- Date: Wed, 22 Feb 2006 18:56:54 -0500 (EST)
- To: public-rdf-dawg-comments@w3.org
Comments on Section 1 and Section 2 of SPARQL Query Language for RDF W3C Working Draft 20 February 2006 http://www.w3.org/TR/2006/WD-rdf-sparql-query-20060220/ These are personal comments, from me, an interested expert. They may not reflect the views of any institution to which I am associated. In general I found the first two sections of the document *very* hard to understand. The mixing of definitions, explanation, information, etc. confused me over and over again. I strongly suggest an organization something like: Introduction (informative) Formal development (normative) Underlying notions (normative) Patterns and matching (normative) SPARQL syntax (normative) Informal narrative (informative) Examples (informative) I also found that things that didn't need to be explained were explained, and things that did need to be explained were not explained. A major example of the latter is the role of the scoping graph. Examples showing why E-matching is defined the way it is would be particularly useful. Because of the problems I see in Section 2, I do not feel that I can adequately understand the remainder of the document. Because of these problems I do not feel that this document should be advanced to the next stage in the W3C recommendation process without going through another last-call stage. (This could however be performed by terminating the current last call, quickly fixing the document, and starting another last call.) Specific comments follow: Section 1. An RDF graph is a set of triples; each triple consists of a <em>subject</em>, a <em>predicate</em> and an <em>object</em>. This is defined in RDF Concepts and Abstract Syntax. C1.1: An unqualified "this" cannot be used at the beginning of the second sentence. The RDF graph may be virtual, in that it is not fully materialized, C1.2: Defining virtual in terms of another term that is not itself defined is not very useful. only doing the work needed for each query to execute. C1.3: Who is doing what work here? SPARQL is a query language for getting information from such RDF graphs. C1.4: Surely a more formal tone is called for here. It provides facilities to: - extract information in the form of URIs, blank nodes, plain and typed literals. - extract RDF subgraphs. - construct new RDF graphs based on information in the queried graphs. C1.5: I don't recognize the intent of SPARQL in any of these options. As a data access language, it is suitable for both local and remote use. C1.6: The "it" is rather too far from its referent. The companion SPARQL Protocol for RDF document describes the remote access protocol. C1.7: What about the "local" access protocol? Is there one? If so, where is it? If not, why is there not one? <!-- Commented Document Outline --> C1.8: There appears to be significant commented-out portions of the document. Do such parts of the document have any import? If so, then they probably should not be commented-out. If not, then the commented-out portions should be removed. Section 2. C2.15: In general, Section 2 switches modes much too much. Which parts of Section 2 are tutorial? Which are definitional? Which are explanatory? The SPARQL query language is based on matching graph patterns. C2.1: What is a "matching graph pattern"? I do not believe that it is defined in the remainder of the document. (Yes, yes, I know that the problem is actually that the sentence itself is poorly constructed.) The simplest graph pattern is the triple pattern, which is like an RDF triple, but with the possibility of a variable instead of an RDF term in the subject, predicate or object positions. C2.4: This should probably be stated more precisely, using, at least "and/or". Combining triple gives a basic graph pattern, where an exact match to a graph is needed to fulfill a pattern. C2.2: Probably "triple" should be "triples". C2.3: I do not believe that this matches the intent of SPARQL queries. The example below shows a SPARQL query to find the title of a book from the information in the given RDF graph. C2.5: The use of "the given" here is not helpful. I feel that it would be better to use an indefinite article instead. The terms delimited by "<>" are IRI references [...]. They stand for IRIs, either directly, or relative to a base IRI. C2.6: What is a term? Which terms? What does "stand for" mean here? What role does the base IRI play in this "stand for" relationship? C2.7: The rules for IRIs are not adequately specified in Section 2.1.1. Are the two abbreviated mechanisms enclosed in "<>"? Can a prefix expand to a relative IRI? optional datatype IRI or prefixed name (introduced by ^^) C2.8: Can this be a relative IRI? Is it expanded using the rules of Section 2.1.1? Variables in SPARQL queries have global scope; it is the same variable everywhere in the query that the same name is used C2.9: Wrong number agreement. Blank nodes are indicated by either the form _:a or use of [ ]. C2.10: Is _:a the *only* blank node allowed? If not, which parts of these bits of syntax can vary, and how? Triple Patterns are written as a list of subject, predicate, object; C2.11: The examples of triple patterns don't seem to be written this way. The following examples express the same query: [several examples] Prefixes are syntactic: the prefix name does not affect the query, nor do prefix names in queries need to be the same prefixes as used in a serialization of the data. The following query is equivalent to the previous examples and will give the same results when applied to the same data: [one example] C2.12: The first group of examples appears to exhibit more internal variability than the single example adds. Why, then, is the single example broken out? Is there something that I am missing here? The data format used in this document is C2.13: What is the "data"? C2.16: Section 2.1 claims to be about "Writing a Simple Query", but doesn't seem to provide any guidance on this topic. 2.2 Initial Definitions C2.14: There appears to have been quite a number of definitions already? How, then, can this be an "initial" set of definitions? A query variable is a member of the set V where V is infinite and disjoint. C2.20: What is V? Perhaps you mean V to be some arbitrary, but fixed set. Definition: Graph Pattern A Graph Pattern is one of: Basic Graph Pattern Group Graph Pattern Value Constraints Optional Graph Pattern Union Graph Pattern RDF Dataset Graph Pattern C2.15: Are these all part of simple queries? If not, what is this doing in Section 2? Ditto for the definition for SPARQL Query. Definition: SPARQL Query A SPARQL query is a tuple (GP, DS, SM, R) where: C2.16: What, then, are the things in Section 2.1 that contain the SELECT keyword? The following triple pattern has a subject variable (the variable book), a predicate dc:title and an object variable (the variable <title). ?book dc:title ?title . C2.17: dc:title does not appear to be valid as any second element of a triple pattern. Definition: Triple Pattern A triple pattern is member of the set: (RDF-T union V) x (I union V) x (RDF-T union V) C2.18: How is the syntax above (?book dc:title ?title .) mapped into this set? This definition of Triple Pattern includes literal subjects. [...] This definition also allows blank nodes in the predicate position. C2.19: The referent is too far away for this construction. Definition: Pattern Solution A variable solution is a substitution function from a subset of V, the set of variables, to the set of RDF terms, RDF-T. A pattern solution, S, is a variable substitution whose domain includes all the variables in V and whose range is a subset of the set of RDF terms. The result of replacing every member v of V in a graph pattern P by S(v) is written S(P). If v is not in the domain of S then S(v) is defined to be v. C2.21: I thought that V was the set of variables. Why then write "all the variables in V"? C2.22: Given that the domain of S is all the variables in V, i.e., all the variables, then what use is the last sentence of the above definition? has a single triple pattern as the query pattern C2.23: What is the "query pattern" of a query? Perhaps you mean the graph pattern of the query? An E-entailment regime is a binary relation between subsets of RDF graphs. C2.24: Perhaps you mean "between sets of RDF graphs"? Definition: Scoping Graph The Scoping Graph G' for RDF graph G, is an RDF Graph that is graph-equivalent to G C2.25: FATAL: There can be many RDF graphs that are graph-equivalent to a particular RDF graph. Therefore the Scoping Graph is not adequately defined. The scoping graph makes the graph to be matched independent of the chosen blank node names. C2.25a: Which chosen blank node names? Why should this matter at all? Aren't the blank node names simply a notational convenience? C2.25b: This needs to be proven. Definition: Basic Graph Pattern E-matching Given an entailment regime E, a basic graph pattern BGP, and RDF graph G, with scoping graph G', then BGP E-matches with pattern solution S on graph G with respect to scoping set B if: - BGP' is a basic graph pattern that is graph-equivalent to BGP - G' and BGP' do not share any blank node labels. - (G' union S(BGP')) is a well-formed RDF graph for E-entailment - G E-entails (G' union S(BGP')) - The RDF terms introduced by S all occur in B. C2.26: Some of the elements of the point list are missing punctuation. C2.27: FATAL: The status of B is not adequately provided. Is B a parameter of E-matching or is it somehow determined by the other parameters? These definitions allow for future extensions to SPARQL. C2.28: Which definitions? This document defines SPARQL for simple entailment and the scoping set B is the set of all RDF terms in G'. C2.29: SPARQL for simple entailment? Probably you mean something like "This document only defines the simple entailment version of SPARQL". C2.30: The second half of this sentence does not make any sense. Perhaps you mean something like "The simple entailment version of SPARQL (hereafter SPARQL) is based on BGP E-matching where the entailment regime (E) is always simple entailment and the scoping set (B) is always the set of RDF terms in G'. C2.31: FATAL: This still leaves SPARQL matching with the following parameters: 1/ the graph pattern BGP 2/ the RDF graph G 3/ the scoping graph G' (which is not adequately defined) The problem with G' needs to be addressed. A pattern solution can then be defined as follows: to match a basic graph pattern under simple entailment, it is possible to proceed by finding a mapping from blank nodes and variables in the basic graph pattern to terms in the graph being matched; a pattern solution is then a mapping restricted to just the variables, possibly with blank nodes renamed. Moreover, a uniqueness property guarantees the interoperability between SPARQL systems: given a graph and a basic graph pattern, the set of all the pattern solutions is unique up to blank node renaming. C2.32: Where is G' in this operation? C2.33: It seems to me that SPARQL simple matching is entirely deterministic. Given BGP, G, and G', the set of pattern solutions that make BGP match G with scope G' is fixed. I then don't understand the "unique up to blank node renaming" above. C2.34: If I am missing something here, and there indeed is something to be shown, then it has to be proven. There is a blank node [..] in this dataset, identified by_:a. C2.34: What is "dataset"? C2.35: Are there not two blank nodes in this dataset? In the SPARQL syntax, Basic Graph Patterns are sequences of triple patterns mixed with value constraints. C2.36: Why not say something like "value constraints can be mixed in sequences of triples patterns. The triple patterns form a BGP."? The results of a query is C2.37: Why not "The result"? C2.39: I believe that it would be very useful to show the four matches generated by the basic query pattern in Section 2.6 (as well as the two matches for the BGP in Section 2.5.3). Blank nodes in the results of a query are identical to those occurring in the dataset graphs C2.38: This is very misleading. SPARQL matching does indeed restrict the bnode in query results to be bnodes from the RDF graph, but not in a useful way. For example, ?x ex:a ex:b . matches against _:a ex:a _:b . with two results for ?x, at least as far as I can determine. C2.39: I believe that there are four matches for the BGP in Section 2.7. Why are only two results given?
Received on Wednesday, 22 February 2006 23:57:06 UTC