Possible long response for WW-1 from Andy Seaborne on 2012-04-26 (public-rdf-dawg@w3.org from April to June 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Thu, 26 Apr 2012 12:57:48 +0100
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4F99383C.4090002@epimorphics.com>
Here is some possible draft content for a response to William Waites. 
His message provides the opportunity to highlight the judgement calls 
the design makes to balance property paths considered in isolation and 
property paths considered along side other SPARQL 1.1 features.  He is 
using the analogy to strings of (non)equivalence of "aa*" and "a+" which 
is a consideration, but not the only one as I see it.

Whether this is best sent as a WG formal response or as a discussion 
item on the comments list (or elsewhere) I don't know - I leave that to 
suggestions; I'm willing to have a discussion with people if they are 
prepared to discuss the use cases.

 Andy

http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Apr/0011.html

On 19/04/12 14:59, William Waites wrote:
 > I was happy to see the revisions earlier in the week to the SPARQL 1.1
 > working draft that saw the default semantics of the *, + and ?
 > operators changed from counting to existential.
 >
 > Is it to be expected that a similar revision will be forthcoming for
 > the simple walk vs. regular path problem signalled by Wim Martens et
 > al?
 >
 > As I understand it, the non-standard W3C simple-walk semantics mean
 > that evaluating path expressions containing those operators is
 > intractable even with counting semantics. See Wim's earlier mail at
 > [1].
 >
 > I understand that the WG is already over time and there is pressure to
 > carve the spec into stone, but it seems to me better to be late than
 > to release something containing a known serious error especially when
 > the fix is clear.
 >
 > I would also like to point out -- and this is the reason I call the
 > simple-walk semantics non-standard -- that basic things like the
 > equivalence of "aa*" and "a+" are not true under the current
 > draft. Apart from questions of tractability, this could be quite
 > confusing to users.
 >
 > Cheers,
 > -w
 >
 > [1] 
http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Feb/0029.html

William,

Using the analogy with regular expressions over strings needs some care. 
  There are significant ways in which a graph does not behave like a 
string.  For example, in RDF, when a literal value, 5 say, is used as 
the object of multiple triples, there is only one occurrence of the 
literal as a graph node (literals are "tidy" in the language of the RDF 
2004 Working Group). When there is a divergence of between string-like 
and graph-like characteristics there is a design decision to be made. 
Duplicate nodes is one such point.

Consider this example of a simple purchase order, with two items.

---- Data 1 ----
@prefix : <http://example/> .

:order :hasItem :item1 .
:order :hasItem :item2 .

:item1 :price 5 .
:item2 :price 2 .
---- Data 1 ----


SPARQL 1.1 introduces grouping and aggregation - it's one of the major 
features missing and a major use feature request [F&R].  These queries 
give the total price of the order.

SELECT (SUM(?price) AS ?total)
WHERE { :order :item/:price ?price }

SELECT (SUM(?price) AS ?total)
WHERE { :order :item [ :price ?price ] }

SELECT (SUM(?price) AS ?total)
WHERE
{ :order :item ?x .
   ?x :price ?price }

The first uses property paths.

---------
| total |
=========
| 7     |
---------


---- Data 2 ----
@prefix : <http://example/> .

:order :hasItem :item1 .
:order :hasItem :item2 .

:item1 :price 5 . # Same price.
:item2 :price 5 .
---- Data 2 ----

Query 1:
---------
| total |
=========
| 5     |
---------

Queries 2 and 3:
---------
| total |
=========
| 10    |
---------

In data 2, the price of the items is the same.  The first query now does 
not return the total price of the order if only connectivity is 
considered.  That is considered to be confusing.

Query 1 is the same as:

SELECT (SUM(DISTINCT ?price) AS ?total)
WHERE
{ :order :item ?x .
   ?x :price ?price
}

using DISTINCT in the sum() aggregation.  This shows it's possible to 
write a SPARQL query with non-counting results given the current 
proposed design (option 6).  A subquery with DISTINCT is the more 
general case but it does suggest that the short syntax ought to favour 
the application writer task.  The converse is not possible - starting 
with unique results and modifying the results for duplicates.


Property paths can be divided into syntactic short forms and 
connectivity operators.  There are distinct features; they could have 
been described in two different sections of the query spec.

The connection is that short forms can be used as a sub-part of a 
connectivity property path expression.  The working group could have 
decided not to cover that and only have connectivity via a single 
property.


When combined, the semantics of the connectivity wins and { ?x (:a/:b)* 
?y } returns a *set* of matches for (?x, ?y).  Computational complexity 
is not increased because the "/" operator can be evaluated in context to 
only yield the necessary unique results which is a simple optimization.

So there is a choice for the sequence property path operator - syntactic 
short cut, or connectivity operator (putting everything as a 
connectivity path).  Which is the best choice is not a technical issue - 
it's a judgement.  The analogous equivalence is aa* and a+ (assuming"/" 
for the sequence on connectivity use cases - it could be a different 
operator) needs to be treated with care - it's not automatic that it is 
desirable when the whole collection of desirable features for SPARQL 1.1 
is considered.

If considered in isolation, and considering only whether a pattern 
matches or not, then connectivity semantics (non-counting), can be 
justified on design symmetry grounds.

Looked at in wider context other factors come into play: there has been 
a lack of recognition of this wider context.  SPARQL 1.1 is a not a 
fundamental redesign of SPARQL around property paths.

The working group has decided that the syntactic short cuts and 
arbitrary length connectivity operators is the best choice in the 
context of SPARQL 1.1 overall.

As an aside, XPath [XP] treats sequences of atomic values differently to 
sequences of nodes.  Sequences of atomic values are not distinct, 
precisely so operations like "sum" and "count" do the expected thing. 
There is an operation in XPath/XQuery functions and operators to make a 
sequence of atomic distinct (fn:distinct-values).  Obviously, the 
reverse case of distinct being expanded to duplicates does not work.

In RDF we can not make the distinction between literals and resources - 
again we have a design decision point.

One final example:

An example in the F&R document is list access.  In property paths, this 
might address that use case:

{ :list rdf:rest*/rdf:first ?member }

and lists can have duplicates.  The judgement is again whether the short 
form should make the use case easily addressable or choose symmetry and 
expect the application writer to know to use the longer form:

{ :list rdf:rest* ?x .
   ?x rdf:first ?member }

when duplicates are possible.


 Andy


[F&R] http://www.w3.org/TR/sparql-features/
[XP] http://www.w3.org/TR/xpath20/#id-path-expressions , bullet 2.
Received on Thursday, 26 April 2012 11:58:23 UTC