SPARQL 1.1 Update - Comments from David Booth on 2011-07-29 (public-rdf-dawg-comments@w3.org from July 2011)

From: David Booth <david@dbooth.org>
Date: Fri, 29 Jul 2011 14:54:07 -0400
To: public-rdf-dawg-comments <public-rdf-dawg-comments@w3.org>
Message-ID: <1311965647.2100.4823.camel@dbooth-laptop>
Regarding
http://www.w3.org/TR/2011/WD-sparql11-update-20110512/
It's great to see these documents in Last Call!

Comments:

1. Please either add capability for virtual graphs or keep the COPY, ADD
and MOVE shortcuts, to enable standard SPARQL to be used more
efficiently as a rules language and in data production pipelines.  COPY,
ADD and MOVE operations cost almost nothing to implement, and they help
with efficiency.  By "virtual graph" I mean a graph that consists of the
merge of a particular set of named graphs -- a very important capability
for efficient data production pipelines.


2. This paragraph in sec 3.1.3 is a bit confusing:
[[
That is, the GroupGraphPattern in the WHERE clause will be matched
against the dataset described by explicit USING or USING NAMED clauses,
if specified, and against the graph store otherwise. Any graph name
specified in a WITH clause will - for evaluating the WHERE clause -
refer to the default graph to be used in the absence of USING or USING
NAMED clauses. In the presence of one or more graphs referred to in
USING clauses, the default graph will be the merge of these graphs,
meaning that the graph in a WITH clause will be ignored while evaluating
the WHERE clause. If there is no USING clause, but there is one or more
USING NAMED clauses, then the dataset will include an empty graph for
the default graph.
]]
In particular, the sentence "Any graph name specified in a WITH clause
will - for evaluating the WHERE clause - refer to the default graph to
be used in the absence of USING or USING NAMED clauses." seems odd.  The
graph specified in the WITH clause will refer to the *default* graph?  I
would think it would be used *instead* of the default graph.  Isn't that
the point of WITH?  Perhaps the term "default graph" is being used in an
unusual way in this paragraph, to mean "the graph that will used in the
absence of USING or USING NAMED"?  I think it would be misleading to
call that a "default graph".  Normally the term "default graph" refers
to the unnamed slot in a Graph Store, per the first paragraph in section
2.  I think it would be best to use the term only in that way.

Part of the confusion may be related to the ambiguous use of the term
"dataset".  For example, consider the sentence: "That is, the
GroupGraphPattern in the WHERE clause will be matched against the
dataset described by . . . ".  When I read this, I took the term
"dataset" to mean:
http://en.wikipedia.org/wiki/Data_set
However, I am wondering if you actually meant "RDF Dataset" as defined
here:
http://www.w3.org/TR/sparql11-query/#rdfDataset
If you meant the former, I suggest using the term "set of data", to
avoid ambiguity.  If you meant the latter, I suggest using the term "RDF
Dataset", and perhaps linking it to its definition.

Also, I notice that:

- There are many occurrences of the unqualified word "dataset".  I
suggest checking them all, to see if they should be "RDF Dataset".

- Capitalization of the terms "RDF Dataset" and "Graph Store" is
inconsistent -- sometimes written "RDF dataset" or "graph store".  It
would help if it were consistently capitalized, as it helps the reader
know that you are intending a specially defined term.

If I have understood the intent, it sounds like there are two sets of
data involved in a DELETE/INSERT operation: one set is used in
evaluating the WHERE clause, and the other is the target graph of the
DELETE/INSERT, i.e., the graph that will be modified by the operation.
If so, I think it would be helpful to state this up front, and make up a
term for each of these sets, such as: "the set of data for the WHERE
clause" and "the target graph".  Hmm, maybe the SPARQL 1.1 Query spec
uses the term "active graph" for the former?
http://www.w3.org/TR/sparql11-query/#rdfDataset 
In any case, it would be helpful to define specific terms for these, and
use them consistently.

Also, it may be clearer to reword this paragraph as a decision tree,
since the logic that is being described is a bit complex for
unstructured English prose: 

  If ___ then ___ . Otherwise, if ___ then ___ . Otherwise ___ .



3. In searching for the definition of the backslash "\" symbol in
section 4.2, it looks like it is supposed to be set difference, but I do
not see it listed in either of these tables of standard mathematical or
logic symbols:
http://en.wikipedia.org/wiki/List_of_mathematical_symbols 
http://en.wikipedia.org/wiki/Table_of_logic_symbols
However, I now see that that is because it is using a different unicode
character, so a browser search did not find it:
http://en.wikipedia.org/wiki/List_of_mathematical_symbols
I suggest adding a brief note of clarification to section 4.2 stating
that the backslash symbol ("\") indicates set difference.  Personally, I
prefer the minus sign ("-") for set difference, though my tastes may be
biased toward certain programming languages.


4. The difference between "USING" and "USING NAMED" is not explained,
except in passing: "This describes a dataset in a manner similar to FROM
and FROM NAMED clauses in the SPARQL1.1 Query Language."


5. As written, this in sec 3.1:
http://www.w3.org/TR/sparql11-update/#graphUpdate
[[
Graph update operations change existing graphs in the Graph Store but do
not explicitly delete nor create them. Non-empty inserts into
non-existing graphs will, however, implicitly create those graphs, i.e.,
an implementation *should* create graphs that do not exist before
triples were inserted into them (there may be implementations providing
an update service over a fixed set of graphs which in such case *must*
return with failure for update requests that would create an unallowed
graph), and *may* remove graphs that are left empty after triples are
removed from them.
]]
seems to say that an implementation that operates over a *variable*
(non-fixed) set of graphs still has the option of not automatically
creating graphs that do not exist.  

I suggest rewording the above portion as:
[[
Graph update operations change existing graphs in the Graph Store but do
not explicitly delete nor create them. Non-empty inserts into
non-existing graphs will normally implicitly create those graphs, i.e.,
an implementation fulfilling an update request *should* silently and
automatically create graphs that do not exist before triples are
inserted into them, and *must* return with failure if it fails to do so
for any reason.  (For example, the implementation may have insufficient
resources, or an implementation may only provide an update service over
a fixed set of graphs.)  An implementation *may* remove graphs that are
left empty after triples are removed from them.
]]


6. Similarly, I suggest rewording the following in section 3.1.1:
http://www.w3.org/TR/sparql11-update/#insertData 
[[
If no graph is described in the QuadData, then the default graph is
presumed. If data is inserted into a graph that does not exist in the
graph store, it *should* be created (there may be implementations
providing an update service over a fixed set of graphs which in such
case *must* return with failure for update requests that insert data
into an unallowed graph).
]]
to:
[[
If no graph is described in the QuadData, then the default graph is
presumed.  If data is inserted into a graph that does not exist in the
graph store, the update service SHOULD create that graph.  The service
MUST return with failure if it fails to do so for any reason.
]]


7. And similarly in section 3.1.3 I suggest changing:
http://www.w3.org/TR/sparql11-update/#deleteInsert 
[[
If an operation tries to insert into a graph that does not exist, then
the update service *should* create that graph.  The service MUST return
with failure if it fails to do so for any reason.  If no data is to be
inserted, then no graph will be created, even if applying the operation
to a different dataset would result in data being inserted.
]]
to: 
[[
If an operation tries to insert into a graph that does not exist, then
that graph should be created; again, there may be implementations
providing an update service over a fixed set of graphs which in such
case must return with failure for update requests that would create an
unallowed graph. If no data is to be inserted, then no graph will be
created, even if applying the operation to a different dataset would
result in data being inserted.
]]


8. How is the URI of a Graph Store indicated?  The concept of a Graph
Store is central to the SPARQL 1.1 Update spec, and hence one should be
able to use a URI to refer to a particular Graph Store, but the spec
doesn't say how this is done.

The SPARQL 1.1 Service Description spec contains no sd:GraphStore
class.  

The SPARQL 1.1 Graph Store HTTP Protocol spec sometimes mentions a Graph
Store, but does not make clear how the intended Graph Store is
identified.  It does say: "A compliant implementation of this
specification SHOULD accept HTTP requests directed at its Graph Store".
But what if a service hosts multiple Graph Stores?  

According to
http://www.w3.org/TR/sparql11-update/#graphStore 
a Graph Store "is a mutable container of RDF graphs managed by a single
service" which "contains one (unnamed) slot holding a default graph and
zero or more named slots holding named graphs".

Language in section 2.1
http://www.w3.org/TR/sparql11-update/#graphStoreQueryServices
"There is no presumption that the graph store managed by an update
service . . . " suggests that an update service can only have *one*
Graph Store, but: (a) I do not see this stated explicitly anywhere; (b)
it would be useful for an update service to be able to have more than
one Graph Store; and (b) what is the point of defining the notion of an
"update service" if it is one-to-one with a Graph Store?  AFAICT, doing
so just adds an unnecessarily layer and confusion.

The SPARQL 1.1 Service Description spec does define the notion of an
sd:DataSet, which is close to the notion of a Graph Store, but (if I
understand the definition of Graph Store in
http://www.w3.org/TR/sparql11-update/#graphStore )
a Graph Store is mutable, whereas an sd:DataSet is not.

The reason one would want to have an update service that contains more
than one Graph Store is that it would allow operations on collections of
graphs to be performed efficiently.  For example, an RDF data pipeline
may need to generate one collection of graphs from another, all within
the same update service.  In other words, the content of one Graph Store
is generated from the content of another Graph Store.  This is important
because for efficiency, it is helpful to be able to subdivide large
graphs into collections of smaller graphs.  An example might be a
collection of 200,000 patient graphs.  There may be *multiple*
collections of these patient graphs, A, B and C, where collection C is
derived from collection B which is derived from collection A in a
pipeline.  Since each patient graph within each of these collections is
relatively independent, it is far more efficient when one in A is
updated to only update the corresponding graphs in B and C, rather than
regenerating the entire B and C collections.  It would be very
convenient if each of these collections could be stored in a
sd:GraphStore (presuming such a class is defined) within the same update
service so that appropriate update operations could be selectively
performed on them, with the assurance (for efficiency) that they are
within the same update service.  

Oddly, there is a distinction between a Graph Store (which is mutable)
and an RDF Dataset (which is not), but there is no corresponding
distinction made with graphs.  They are treated as mutable in the SPARQL
1.1 Update spec: they can be the subject of an INSERT or DELETE
operation.

Actually, in reading the definition of RDF Dataset
http://www.w3.org/TR/sparql11-query/#rdfDataset
I do not see anything that would prevent it from changing over time.
Certainly an RDF Dataset contains a particular set of graphs at the
moment when it is queried, but I see no prohibition against that same
RDF Dataset containing a different set of graphs at a different time.
Hence, it looks to me like the notion of Graph Store could be dropped in
favor of using the term "RDF Datastore" universally throughout both the
Query and Update documents.  I think this would make more sense than
using two different terms: both queries and updates would operate on RDF
Datasets.  


9. Typo: s/needs not be authoritative/need not be authoritative/



Thanks!


-- 
David Booth, Ph.D.
http://dbooth.org/

Opinions expressed herein are those of the author and do not necessarily
reflect those of his employer.
Received on Friday, 29 July 2011 18:54:42 UTC