Re: regular path expressions from Andy Seaborne on 2012-05-10 (public-rdf-dawg-comments@w3.org from May 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Thu, 10 May 2012 10:37:07 +0100
To: public-rdf-dawg-comments@w3.org
Message-ID: <4FAB8C43.5050803@epimorphics.com>
On 19/04/12 14:59, William Waites wrote:
> I was happy to see the revisions earlier in the week to the SPARQL 1.1
> working draft that saw the default semantics of the *, + and ?
> operators changed from counting to existential.
>
> Is it to be expected that a similar revision will be forthcoming for
> the simple walk vs. regular path problem signalled by Wim Martens et
> al?
>
> As I understand it, the non-standard W3C simple-walk semantics mean
> that evaluating path expressions containing those operators is
> intractable even with counting semantics. See Wim's earlier mail at
> [1].
>
> I understand that the WG is already over time and there is pressure to
> carve the spec into stone, but it seems to me better to be late than
> to release something containing a known serious error especially when
> the fix is clear.
>
> I would also like to point out -- and this is the reason I call the
> simple-walk semantics non-standard -- that basic things like the
> equivalence of "aa*" and "a+" are not true under the current
> draft. Apart from questions of tractability, this could be quite
> confusing to users.
>
> Cheers,
> -w
>
> [1] http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Feb/0029.html

William,

Using the analogy with regular expressions over strings needs some care. 
  There are significant ways in which a graph does not behave like a 
string.  For example, in RDF, when a literal value, 5 say, is used as 
the object of multiple triples, there is only one occurrence of the 
literal as a graph node (literals are "tidy" in the language of the RDF 
2004 Working Group). When there is a divergence of between string-like 
and graph-like characteristics there is a design decision to be made. 
Duplicate nodes is one such point.

Consider this example of a simple purchase order, with two items.

---- Data 1 ----
@prefix : <http://example/> .

:order :hasItem :item1 .
:order :hasItem :item2 .

:item1 :price 5 .
:item2 :price 2 .
---- Data 1 ----


SPARQL 1.1 introduces grouping and aggregation - it's one of the major 
features missing and a major new feature request [F&R].  These queries 
give the total price of the order.

SELECT (SUM(?price) AS ?total)
WHERE { :order :hasItem/:price ?price }

SELECT (SUM(?price) AS ?total)
WHERE { :order :hasItem [ :price ?price ] }

SELECT (SUM(?price) AS ?total)
WHERE
{ :order :hasItem ?x .
   ?x :price ?price }

The first uses property paths.

---------
| total |
=========
| 7     |
---------

---- Data 2 ----
@prefix : <http://example/> .

:order :hasItem :item1 .
:order :hasItem :item2 .

:item1 :price 5 .    # Same price.
:item2 :price 5 .
---- Data 2 ----

Query 1:
---------
| total |
=========
| 5     |
---------

Queries 2 and 3:
---------
| total |
=========
| 10    |
---------

In data 2, the price of the two items is the same.  The first query now 
does not return the total price of the order if only connectivity is 
considered.  That is considered to be confusing.

Query 1 is the same as:

SELECT (SUM(DISTINCT ?price) AS ?total)
WHERE
{ :order :hasItem ?x .
   ?x :price ?price
}

using DISTINCT in the sum() aggregation.  This shows it's possible to 
write a SPARQL query with non-counting results given the current 
proposed design.  A subquery with DISTINCT is another alternative. Note 
that the converse is not possible - starting with unique results and 
modifying the results for duplicates.

Property paths can be divided into syntactic short forms and 
connectivity operators.  There are distinct features; they could have 
been described in two different sections of the query spec.

The connection is that short forms can be used as a sub-part of a 
connectivity property path expression.  The working group could have 
decided not to cover that and only have connectivity via a single property.
William,

Using the analogy with regular expressions over strings needs some care. 
  There are significant ways in which a graph does not behave like a 
string.  For example, in RDF, when a literal value, 5 say, is used as 
the object of multiple triples, there is only one occurrence of the 
literal as a graph node (literals are "tidy" in the language of the RDF 
2004 Working Group). When there is a divergence of between string-like 
and graph-like characteristics there is a design decision to be made. 
Duplicate nodes is one such point.

Consider this example of a simple purchase order, with two items.

---- Data 1 ----
@prefix : <http://example/> .

:order :hasItem :item1 .
:order :hasItem :item2 .

:item1 :price 5 .
:item2 :price 2 .
---- Data 1 ----


SPARQL 1.1 introduces grouping and aggregation - it's one of the major 
features missing and a major new feature request [F&R].  These queries 
give the total price of the order.

SELECT (SUM(?price) AS ?total)
WHERE { :order :hasItem/:price ?price }

SELECT (SUM(?price) AS ?total)
WHERE { :order :hasItem [ :price ?price ] }

SELECT (SUM(?price) AS ?total)
WHERE
{ :order :hasItem ?x .
   ?x :price ?price }

The first uses property paths.

---------
| total |
=========
| 7     |
---------


---- Data 2 ----
@prefix : <http://example/> .

:order :hasItem :item1 .
:order :hasItem :item2 .

:item1 :price 5 .    # Same price.
:item2 :price 5 .
---- Data 2 ----

Query 1:
---------
| total |
=========
| 5     |
---------

Queries 2 and 3:
---------
| total |
=========
| 10    |
---------

In data 2, the price of the two items is the same.  The first query now 
does not return the total price of the order if only connectivity is 
considered.  That is considered to be confusing.

Query 1 is the same as:

SELECT (SUM(DISTINCT ?price) AS ?total)
WHERE
{ :order :hasItem ?x .
   ?x :price ?price
}

using DISTINCT in the sum() aggregation.  This shows it's possible to 
write a SPARQL query with non-counting results given the current 
proposed design.  A subquery with DISTINCT is another alternative. Note 
that the converse is not possible - starting with unique results and 
modifying the results for duplicates.

Property paths can be divided into syntactic short forms and 
connectivity operators.  There are distinct features; they could have 
been described in two different sections of the query spec.

The connection is that short forms can be used as a sub-part of a 
connectivity property path expression.  The working group could have 
decided not to cover that and only have connectivity via a single property.

When combined, the semantics of the connectivity wins and { ?x (:a/:b)* 
?y } returns a *set* of matches for (?x, ?y).  Computational complexity 
is not increased because the "/" operator can be evaluated in context to 
only yield the necessary unique results which is a simple optimization.

So there is a choice for the sequence property path operator - syntactic 
short cut, or connectivity operator (putting everything as a 
connectivity path).  Which is the best choice is not a technical issue - 
it's a judgement.

If considered in isolation, and considering only whether a pattern 
matches or not, then connectivity semantics (non-counting), can be 
justified on design symmetry grounds.

Looked at in wider context other factors come into play.  SPARQL 1.1 is 
a not a fundamental redesign of SPARQL around property paths.

The working group has decided that the syntactic short cuts and 
arbitrary length connectivity operators, and when combined, connectivity 
semantcis applies, is the best choice in the context of SPARQL 1.1 overall.

As an aside, XPath [XP] treats sequences of atomic values differently to 
sequences of nodes.  Sequences of atomic values are not distinct, 
precisely so operations like "sum" and "count" do the expected thing. 
There is an operation in XPath/XQuery functions and operators to make a 
sequence of atomic distinct (fn:distinct-values).  Obviously, the 
reverse case of distinct being expanded to duplicates does not work.

In RDF we can not make the distinction between literals and resources - 
again we have a design decision point.

One final example:

An example in the F&R document is list access.  In property paths, this 
might address that use case:

{ :list rdf:rest*/rdf:first ?member }

and lists can have duplicates.  The judgement is again whether the short 
form should make the use case easily addressable or choose symmetry and 
expect the application writer to know to use the longer form:

{ :list rdf:rest* ?x .
   ?x rdf:first ?member }

when duplicates are possible.

We would be grateful if you would acknowledge that your comment has been
addressed by sending a message to this list.

	Andy
	On behalf of SPARQL-WG


[F&R] http://www.w3.org/TR/sparql-features/
[XP] http://www.w3.org/TR/xpath20/#id-path-expressions , bullet 2.

When combined, the semantics of the connectivity wins and { ?x (:a/:b)* 
?y } returns a *set* of matches for (?x, ?y).  Computational complexity 
is not increased because the "/" operator can be evaluated in context to 
only yield the necessary unique results which is a simple optimization.

So there is a choice for the sequence property path operator - syntactic 
short cut, or connectivity operator (putting everything as a 
connectivity path).  Which is the best choice is not a technical issue - 
it's a judgement.

If considered in isolation, and considering only whether a pattern 
matches or not, then connectivity semantics (non-counting), can be 
justified on design symmetry grounds.

Looked at in wider context other factors come into play.  SPARQL 1.1 is 
a not a fundamental redesign of SPARQL around property paths.

The working group has decided that the syntactic short cuts and 
arbitrary length connectivity operators, and when combined, connectivity 
semantcis applies, is the best choice in the context of SPARQL 1.1 overall.

As an aside, XPath [XP] treats sequences of atomic values differently to 
sequences of nodes.  Sequences of atomic values are not distinct, 
precisely so operations like "sum" and "count" do the expected thing. 
There is an operation in XPath/XQuery functions and operators to make a 
sequence of atomic distinct (fn:distinct-values).  Obviously, the 
reverse case of distinct being expanded to duplicates does not work.

In RDF we can not make the distinction between literals and resources - 
again we have a design decision point.

One final example:

An example in the F&R document is list access.  In property paths, this 
might address that use case:

{ :list rdf:rest*/rdf:first ?member }

and lists can have duplicates.  The judgement is again whether the short 
form should make the use case easily addressable or choose symmetry and 
expect the application writer to know to use the longer form:

{ :list rdf:rest* ?x .
   ?x rdf:first ?member }

when duplicates are possible.

We would be grateful if you would acknowledge that your comment has been 
addressed by sending a message to this list.

	Andy
	On behalf of SPARQL-WG


[F&R] http://www.w3.org/TR/sparql-features/
[XP] http://www.w3.org/TR/xpath20/#id-path-expressions , bullet 2.
Received on Thursday, 10 May 2012 09:37:45 UTC