Re: summary on options for JP-4 Comment about the semantics of property paths from Andy Seaborne on 2012-02-11 (public-rdf-dawg@w3.org from January to March 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Sat, 11 Feb 2012 16:55:19 +0000
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4F369D77.4010207@epimorphics.com>
== Short version

I can live with either option.

I prefer option 2 when done properly but as things stand, I have to say 
that I can't see option 2 + quick 3LC as viable.

I've implemented, experimentally, a form of option 2.  It's available to 
all.

== Consequences of option 2 (DISTINCT)

Before we decide, I want to point out what I think are the implications 
and consequences of this option.  I've done a quick implementation (for 
functionality, not performance) and looked at the impact on the spec.

This is not to put barriers in the way; it is because we do need to have 
a single proposal here before deciding on this route. There are 
different choices within the description Axel gave and details to work out.

Design 1:
DISTINCT() is a new path operator and makes everything inside the () 
yield unique solutions.  This is probably what we have been meaning but 
it's not the only choice.

Design 2:
As design 1 but DISTINCT is only allowed at the top level of a path.
Probably it similar to SELECT DISTINCT bracketing (needs checking though).

Matt's
{ :a :p1{2,3}/DISTINCT(:p2*)/:p3{1,2} ?d }
fails this.

Design 3:
DISTINCT() modifies the * and + operators to make them deal in unique 
solutions but leaves other operators used inside () alone.

This design preserves the property path transformation rewrite rules.

DISTINCT is a modifier on * and + alone; I have to check whether this is 
just a change to ALP or not. (ALP is defined in terms of node->var so 
what happens about ?x :p* ?y, two vars?)

It is very confusing syntax.  It would be better to have unique-* 
operators, say !* and !+.  Not sure what happens to {n,m}

I'm not sure of all the consequences of this is P in P!* can itself be a 
compound path.

Design 4:
Only DISTINCT(:p*) and DISTINCT(:p+) are allowed.  :p is an IRI.

== Spec

The changes to the descriptive section look OK.

The formal section is impacted more widely.

The definition of path evaluation is defined by transformation for most 
of the operators, leaving operators for *, +, {0} and 
NegatedPropertySet.  There are no operators for "/", "|", "^" or much of 
{n,m} in the evaulation.

This is how we maximised the interactions of property paths and entailment.

If we add DISTINCT(path) for any path, we can either leave the 
definitions as-is and wrap in a PathDistinct that evaluates like
(distinct (project  ... )).  [This is an outline and must to be checked.]

We may need to redo ALP or the operators for * to and + to make it clear 
what happens in the uniqueness case although this is technically 
unnecessary.

However, that is just doing the minimum and I don't think it makes for a 
clear spec because it is not explicit about combined expressions and 
cardinality.

A proper incorporation is to define operators for all the path 
expressions, with cardinality rules and execution.  We can note 
transformations for non-distinct cases.

This then accounts for the cardinality of mixed expressions like:

{ :a :p1{2,3}/DISTINCT(:p2*)/:p3{1,2} ?d }

Also, the effect on implementation might be to push them to have to 
implement all the path operators at runtime, not rewrite them to other 
forms to leave a few core path operators.  I don't know how other 
implementations work.

== Implementation

I've done a version of ARQ that implements design 1.  The grammar change 
is trivial: change PathPrimary to have the DISTINCT(path) case.

The evaluation change is trivial but that's because of the how ARQ 
works.  It already had a complete evaluator that works on un-transformed 
property paths so I was reusing that.

The implementation is to add a new path expression node type for 
distinct.  It's a modifier like {n,m} or *, just the syntax is prefix 
function form.

My current implementation is not performant; it's simple and functional 
for experimentation  - it simply makes unique the results of evaluating 
the subpart so it's not exploiting the DISTINCT condition yet and does 
not change the evaluation costs.  Because ARQ uses a recursive evaluator 
without caching of partial results, it looks simple to add a different 
evaluator.

The "caching" (memo-izing) is important for a time-space tradeoff of 
evaluation and will change the complexity growth function even in the 
counting case but it's not the UC that ARQ is targeted at.

	Andy

On 09/02/12 21:10, Axel Polleres wrote:
> (in completion of ACTION-587)
>
> Dear all, as discussed in the last Telco, we have several options on
> how to proceed with addressing  comment JP-4 [1]. If possible, I
> would like to get consensus on how to proceed here in the next
> Telco.
>
> In the previous Telco [2], we seemed to have consensus that we do not
> aim to switch the default behaviour from counting semantics to
> distinct paths.
>
> Now two possibilities to proceed were discussed:
>
> Option 1... keep everything as it is in the grammar, and explain
> which DISTINCT path subqueries can be optimized: As outlined in my
> email below, it might not be entirely trivial to argue in response to
> the comment that this would be equivalent to the JP-4 proposed
> semantics, I am not 100% sure whether/how to define a rewriting to
> wrap all path expressions into DISTINCT subqueries, such that it
> would be equivalent to their semantics (e.g. regarding bnode []
> shortcuts).
>
> Option 2 ... add DISTINCT around paths: It seems that sticking to our
> intended semantics and allowing - orthogonally to their ALLPATHS
> keyword proposal the keyword DISTINCT( ) around path expressions
> switching to existential paths semantics would be equivalent to the
> JP-4 existential paths semantics as outlined in Section 7.1 of their
> paper, and thus optimizable.
>
> Unlike someone sees a 3rd alternative, I would like to propose to
> decide between those two options next time and proceed, discussion
> prior to the call on email would be appreciated.
>
> Option 2 might be easier to implement, but also requires us to go for
> another LC round, as it would change the grammar. I think, in case we
> skip PR and manage to republish very soon, we would still manage to
> stay within time limits, but I would also like to know the team
> contacts' opinion on that.
Received on Saturday, 11 February 2012 16:55:46 UTC