Re: SPARQL 1.1 Update - Comments from Paul Gearon on 2012-07-31 (public-rdf-dawg-comments@w3.org from July 2012)

From: Paul Gearon <gearon@ieee.org>
Date: Tue, 31 Jul 2012 10:25:25 -0400
To: David Booth <david@dbooth.org>
Cc: public-rdf-dawg-comments <public-rdf-dawg-comments@w3.org>
Message-ID: <CAGZNPFks_x3LOUb2xvakhidrRZUCrtfW80937DfZZioFAsOXsw@mail.gmail.com>
Hello David,

Thank you for your comments. I apologise that this response has been
so long delayed. Please be assured that your comments were addressed
in the SPARQL Update document some time ago, though this formal
response was stuck in the queue until now.

We have addressed your concerns below.

On Fri, Jul 29, 2011 at 2:54 PM, David Booth <david@dbooth.org> wrote:
> Regarding
> http://www.w3.org/TR/2011/WD-sparql11-update-20110512/
> It's great to see these documents in Last Call!
>
> Comments:
>
> 1. Please either add capability for virtual graphs or keep the COPY, ADD
> and MOVE shortcuts, to enable standard SPARQL to be used more
> efficiently as a rules language and in data production pipelines.  COPY,
> ADD and MOVE operations cost almost nothing to implement, and they help
> with efficiency.  By "virtual graph" I mean a graph that consists of the
> merge of a particular set of named graphs -- a very important capability
> for efficient data production pipelines.

The features of COPY, ADD and MOVE were considered "At Risk" until the
working group was confident that they could be implemented without
undue difficulty. Now that we have some reports of successful
implementation, the "At Risk" designation has been removed.

The group feels that adding a feature like "virtual graphs" at this
late stage of publication is not possible.


> 2. This paragraph in sec 3.1.3 is a bit confusing:
> [[
> That is, the GroupGraphPattern in the WHERE clause will be matched
> against the dataset described by explicit USING or USING NAMED clauses,
> if specified, and against the graph store otherwise. Any graph name
> specified in a WITH clause will - for evaluating the WHERE clause -
> refer to the default graph to be used in the absence of USING or USING
> NAMED clauses. In the presence of one or more graphs referred to in
> USING clauses, the default graph will be the merge of these graphs,
> meaning that the graph in a WITH clause will be ignored while evaluating
> the WHERE clause. If there is no USING clause, but there is one or more
> USING NAMED clauses, then the dataset will include an empty graph for
> the default graph.
> ]]
> In particular, the sentence "Any graph name specified in a WITH clause
> will - for evaluating the WHERE clause - refer to the default graph to
> be used in the absence of USING or USING NAMED clauses." seems odd.  The
> graph specified in the WITH clause will refer to the *default* graph?  I
> would think it would be used *instead* of the default graph.  Isn't that
> the point of WITH?  Perhaps the term "default graph" is being used in an
> unusual way in this paragraph, to mean "the graph that will used in the
> absence of USING or USING NAMED"?  I think it would be misleading to
> call that a "default graph".  Normally the term "default graph" refers
> to the unnamed slot in a Graph Store, per the first paragraph in section
> 2.  I think it would be best to use the term only in that way.

Unfortunately, the term "default graph" has two accepted meanings. The
first is the graph that may be referred to without a name in a graph
store (not necessarily an unnamed graph), while the second refers to
the the graph that is referenced in a SPARQL WHERE clause when no
GRAPH block has been specified. By default, these two are equivalent,
but the latter is modified to be the merge of all graphs listed in
FROM clauses in a query (USING in updates) or by specifying a
default-graph-uri parameter in the SPARQL protocol.

We have changed the text to the following to clarify the use of WITH:

"That is, the GroupGraphPattern in the WHERE clause will be matched
against the dataset described by explicit USING or USING NAMED
clauses, if specified, and against the default graph provided by the
Graph Store otherwise.

The WITH clause provides a convenience for when an operation primarily
refers to a single graph. If a graph name is specified in a WITH
clause, then - for the purposes of evaluating the WHERE clause - this
will define a dataset containing a default graph with the specified
name, but only in the absence of USING or USING NAMED clauses. In the
presence of one or more graphs referred to in USING clauses and/or
USING NAMED clauses, the WITH clause will be ignored while evaluating
the WHERE clause."


> Part of the confusion may be related to the ambiguous use of the term
> "dataset".  For example, consider the sentence: "That is, the
> GroupGraphPattern in the WHERE clause will be matched against the
> dataset described by . . . ".  When I read this, I took the term
> "dataset" to mean:
> http://en.wikipedia.org/wiki/Data_set
> However, I am wondering if you actually meant "RDF Dataset" as defined
> here:
> http://www.w3.org/TR/sparql11-query/#rdfDataset
> If you meant the former, I suggest using the term "set of data", to
> avoid ambiguity.  If you meant the latter, I suggest using the term "RDF
> Dataset", and perhaps linking it to its definition.
>
> Also, I notice that:
>
> - There are many occurrences of the unqualified word "dataset".  I
> suggest checking them all, to see if they should be "RDF Dataset".

Existing documentation from SPARQL 1.0 already uses the both term
"dataset" as an abbreviation for "RDF dataset", so we do not feel that
it is necessary to use the complete term on every occasion. However,
we have expanded the term each time that a paragraph first uses it.
Despite a link to "Querying the Dataset" already being present in the
preceding paragraph we have added the requested link.


> - Capitalization of the terms "RDF Dataset" and "Graph Store" is
> inconsistent -- sometimes written "RDF dataset" or "graph store".  It
> would help if it were consistently capitalized, as it helps the reader
> know that you are intending a specially defined term.

"RDF dataset" was consistently capitalized in the prose, however it
has been updated to include a capitalized "D" to help the reader
realize that it is a formal term. The abbreviated term "dataset" has
remained unchanged. "Graph Store" has been updated.


> If I have understood the intent, it sounds like there are two sets of
> data involved in a DELETE/INSERT operation: one set is used in
> evaluating the WHERE clause, and the other is the target graph of the
> DELETE/INSERT, i.e., the graph that will be modified by the operation.
> If so, I think it would be helpful to state this up front, and make up a
> term for each of these sets, such as: "the set of data for the WHERE
> clause" and "the target graph".  Hmm, maybe the SPARQL 1.1 Query spec
> uses the term "active graph" for the former?
> http://www.w3.org/TR/sparql11-query/#rdfDataset
> In any case, it would be helpful to define specific terms for these, and
> use them consistently.

The terms "RDF dataset" and dataset are now used in this text entirely
in the context of the data that the WHERE clause will be matched
against. DELETE and INSERT may each refer to multiple graphs, making a
term like "target graph" difficult to manage. The changes made to this
section may now address some of the confusion being posed here.


> Also, it may be clearer to reword this paragraph as a decision tree,
> since the logic that is being described is a bit complex for
> unstructured English prose:
>
>   If ___ then ___ . Otherwise, if ___ then ___ . Otherwise ___ .

The purpose of this section of text is to provide a description in
prose. We hope that the changes have made the text clearer.


> 3. In searching for the definition of the backslash "\" symbol in
> section 4.2, it looks like it is supposed to be set difference, but I do
> not see it listed in either of these tables of standard mathematical or
> logic symbols:
> http://en.wikipedia.org/wiki/List_of_mathematical_symbols
> http://en.wikipedia.org/wiki/Table_of_logic_symbols
> However, I now see that that is because it is using a different unicode
> character, so a browser search did not find it:
> http://en.wikipedia.org/wiki/List_of_mathematical_symbols
> I suggest adding a brief note of clarification to section 4.2 stating
> that the backslash symbol ("\") indicates set difference.  Personally, I
> prefer the minus sign ("-") for set difference, though my tastes may be
> biased toward certain programming languages.

The character "\" has been replaced with the word "minus", and text
has been provided to explain that this refers to "set difference".


> 4. The difference between "USING" and "USING NAMED" is not explained,
> except in passing: "This describes a dataset in a manner similar to FROM
> and FROM NAMED clauses in the SPARQL1.1 Query Language."

We have replaced the phrase: "in a manner similar to FROM and FROM
NAMED" with: "in the same way as FROM and FROM NAMED" and have
provided a direct link to
http://www.w3.org/TR/sparql11-query/#specifyingDataset


> 5. As written, this in sec 3.1:
> http://www.w3.org/TR/sparql11-update/#graphUpdate
> [[
> Graph update operations change existing graphs in the Graph Store but do
> not explicitly delete nor create them. Non-empty inserts into
> non-existing graphs will, however, implicitly create those graphs, i.e.,
> an implementation *should* create graphs that do not exist before
> triples were inserted into them (there may be implementations providing
> an update service over a fixed set of graphs which in such case *must*
> return with failure for update requests that would create an unallowed
> graph), and *may* remove graphs that are left empty after triples are
> removed from them.
> ]]
> seems to say that an implementation that operates over a *variable*
> (non-fixed) set of graphs still has the option of not automatically
> creating graphs that do not exist.
>
> I suggest rewording the above portion as:
> [[
> Graph update operations change existing graphs in the Graph Store but do
> not explicitly delete nor create them. Non-empty inserts into
> non-existing graphs will normally implicitly create those graphs, i.e.,
> an implementation fulfilling an update request *should* silently and
> automatically create graphs that do not exist before triples are
> inserted into them, and *must* return with failure if it fails to do so
> for any reason.  (For example, the implementation may have insufficient
> resources, or an implementation may only provide an update service over
> a fixed set of graphs.)  An implementation *may* remove graphs that are
> left empty after triples are removed from them.
> ]]

Done, with minor changes:

"Graph update operations change existing graphs in the Graph Store but
do not explicitly delete nor create them. Non-empty inserts into
non-existing graphs will, however, implicitly create those graphs,
i.e., an implementation fulfilling an update request should silently
an automatically create graphs that do not exist before triples are
inserted into them, and must return with failure if it fails to do so
for any reason. (For example, the implementation may have insufficient
resources, or an implementation may only provide an update service
over a fixed set of graphs and the implicitly created graph is not
within this fixed set). An implementation may remove graphs that are
left empty after triples are removed from them."


> 6. Similarly, I suggest rewording the following in section 3.1.1:
> http://www.w3.org/TR/sparql11-update/#insertData
> [[
> If no graph is described in the QuadData, then the default graph is
> presumed. If data is inserted into a graph that does not exist in the
> graph store, it *should* be created (there may be implementations
> providing an update service over a fixed set of graphs which in such
> case *must* return with failure for update requests that insert data
> into an unallowed graph).
> ]]
> to:
> [[
> If no graph is described in the QuadData, then the default graph is
> presumed.  If data is inserted into a graph that does not exist in the
> graph store, the update service SHOULD create that graph.  The service
> MUST return with failure if it fails to do so for any reason.
> ]]

Done, with minor modification. The text now reads as:

"The information how a graph store is accessed is defined in the
protocol and graph store protocol specs. A graph store is accessible
by either an update service (cf. protocol) or via the graph store
protocol (cf. graph store protocol). In either case the graph store is
hidden behind the service, making it accessible via the URI of a
SPARQL update service or via a URI that responds to the graph store
protocol."


> 7. And similarly in section 3.1.3 I suggest changing:
> http://www.w3.org/TR/sparql11-update/#deleteInsert
> [[
> If an operation tries to insert into a graph that does not exist, then
> the update service *should* create that graph.  The service MUST return
> with failure if it fails to do so for any reason.  If no data is to be
> inserted, then no graph will be created, even if applying the operation
> to a different dataset would result in data being inserted.
> ]]
> to:
> [[
> If an operation tries to insert into a graph that does not exist, then
> that graph should be created; again, there may be implementations
> providing an update service over a fixed set of graphs which in such
> case must return with failure for update requests that would create an
> unallowed graph. If no data is to be inserted, then no graph will be
> created, even if applying the operation to a different dataset would
> result in data being inserted.
> ]]

Done.


> 8. How is the URI of a Graph Store indicated?  The concept of a Graph
> Store is central to the SPARQL 1.1 Update spec, and hence one should be
> able to use a URI to refer to a particular Graph Store, but the spec
> doesn't say how this is done.
>
> The SPARQL 1.1 Service Description spec contains no sd:GraphStore
> class.
>
> The SPARQL 1.1 Graph Store HTTP Protocol spec sometimes mentions a Graph
> Store, but does not make clear how the intended Graph Store is
> identified.  It does say: "A compliant implementation of this
> specification SHOULD accept HTTP requests directed at its Graph Store".
> But what if a service hosts multiple Graph Stores?
>
> According to
> http://www.w3.org/TR/sparql11-update/#graphStore
> a Graph Store "is a mutable container of RDF graphs managed by a single
> service" which "contains one (unnamed) slot holding a default graph and
> zero or more named slots holding named graphs".
>
> Language in section 2.1
> http://www.w3.org/TR/sparql11-update/#graphStoreQueryServices
> "There is no presumption that the graph store managed by an update
> service . . . " suggests that an update service can only have *one*
> Graph Store, but: (a) I do not see this stated explicitly anywhere; (b)
> it would be useful for an update service to be able to have more than
> one Graph Store; and (b) what is the point of defining the notion of an
> "update service" if it is one-to-one with a Graph Store?  AFAICT, doing
> so just adds an unnecessarily layer and confusion.
>
> The SPARQL 1.1 Service Description spec does define the notion of an
> sd:DataSet, which is close to the notion of a Graph Store, but (if I
> understand the definition of Graph Store in
> http://www.w3.org/TR/sparql11-update/#graphStore )
> a Graph Store is mutable, whereas an sd:DataSet is not.

Graph stores are referred to by URI, but beyond this the
implementation is free to choose. This has been left unspecified
intentionally to allow each implementation to specify the details
individually.


> The reason one would want to have an update service that contains more
> than one Graph Store is that it would allow operations on collections of
> graphs to be performed efficiently.  For example, an RDF data pipeline
> may need to generate one collection of graphs from another, all within
> the same update service.  In other words, the content of one Graph Store
> is generated from the content of another Graph Store.  This is important
> because for efficiency, it is helpful to be able to subdivide large
> graphs into collections of smaller graphs.  An example might be a
> collection of 200,000 patient graphs.  There may be *multiple*
> collections of these patient graphs, A, B and C, where collection C is
> derived from collection B which is derived from collection A in a
> pipeline.  Since each patient graph within each of these collections is
> relatively independent, it is far more efficient when one in A is
> updated to only update the corresponding graphs in B and C, rather than
> regenerating the entire B and C collections.  It would be very
> convenient if each of these collections could be stored in a
> sd:GraphStore (presuming such a class is defined) within the same update
> service so that appropriate update operations could be selectively
> performed on them, with the assurance (for efficiency) that they are
> within the same update service.
>
> Oddly, there is a distinction between a Graph Store (which is mutable)
> and an RDF Dataset (which is not), but there is no corresponding
> distinction made with graphs.  They are treated as mutable in the SPARQL
> 1.1 Update spec: they can be the subject of an INSERT or DELETE
> operation.
>
> Actually, in reading the definition of RDF Dataset
> http://www.w3.org/TR/sparql11-query/#rdfDataset
> I do not see anything that would prevent it from changing over time.
> Certainly an RDF Dataset contains a particular set of graphs at the
> moment when it is queried, but I see no prohibition against that same
> RDF Dataset containing a different set of graphs at a different time.
> Hence, it looks to me like the notion of Graph Store could be dropped in
> favor of using the term "RDF Datastore" universally throughout both the
> Query and Update documents.  I think this would make more sense than
> using two different terms: both queries and updates would operate on RDF
> Datasets.

While queries operate on a dataset that is defined as a merge of
multiple graphs, any updates must necessarily modify a single graph at
a time. So it is not possible to state that updates operate on RDF
Datasets.

While a single INSERT or DELETE template may refer to multiple graphs,
the triples being specified are always for individual graphs. So to
remove the same triples from graphs <foo> and <bar> there is no way to
do it with a single pattern in a template, but rather both graphs must
be mentioned explicitly with that template. ie.:

DELETE { GRAPH <foo> { ... } GRAPH <bar> { ... }} ...


> 9. Typo: s/needs not be authoritative/need not be authoritative/

Done.


We would be grateful if you would acknowledge that your comment has
been answered by sending a reply to this mailing list.

Paul Gearon,
on behalf of the SPARQL WG
Received on Tuesday, 31 July 2012 14:25:58 UTC