Re: [TF-PP] Moving forward from Matt Perry on 2010-06-15 (public-rdf-dawg@w3.org from April to June 2010)

From: Matt Perry <matthew.perry@oracle.com>
Date: Tue, 15 Jun 2010 08:46:48 -0400
To: Andy Seaborne <andy.seaborne@talis.com>
CC: public-rdf-dawg@w3.org
Message-ID: <4C177638.9020200@oracle.com>
Hi Andy,

Please find my comments inline.

-Matt

Andy Seaborne wrote:
> Matt,
>
> When we had:
> [[
> PROPOSED: The cardinality of solutions to fixed-length paths
> is the same as the cardinality of solutions to the path expanded into
> triple patterns (with all variables projected); the cardinality of
> solutions to variable-length paths is the cardinality of solutions
> via paths that do not repeat nodes; the cardinality of solutions to
> paths combining fixed and variable length (elt{n,} ) is a combination
> of the fixed definition plus the variable definition for paths longer
> than the fixed length.
> ]]
>
> it looked reasonable at the time but I'm finding that it isn't do 
> clear cut a distinction of variable and fixed length paths.
>
> :p{n}
>   which is similar or the same as p{n,n}
>   which is similar or the same as :p/:p/..../:p
> :p{n,m}
>   using the text above it is :p{n} / :p{0,(m-n)}
>   which is similar or the same as :p/:p/:p?/:p?
>      (n times :p and (m-n :p?)
> :p+
>   which is also :p{1,}
>
> Birte suggests that
>
> :p{n,m} is
>    :p{n} UNION :p{n+1} ... UNION :p{m}
>
> although this is problematic because it goes round loops multiple 
> times if fixed length paths are the same as their triple pattern 
> expansion.

You make a good point about allowing arbitrary repetition of loops when 
evaluating the triple pattern expansion of a fixed length property path. 
Because all of our proposals prevent repetition of loops when evaluating 
unbounded property paths, we will always have some awkwardness in corner 
cases. Take the following example:

:a :p :c
:c :p :c
:c :p :z

{ :a :p{4} ?y } finds the path :a -:p-> :c -:p-> :c -:p-> :c -:p-> :z, 
but { :a :p{3,} ?y } misses this path.


>
> :p{1,} needs to have a relationship to :p{1,X} as X tends to infinity. 
> When X is larger than the longest path in the graph, it's going to odd 
> (but possible) to have :p+ and :p{1,X} returning different answers.
>
> I can see that :p{n,m} defined as a union for small n and large m is 
> going to an implementation burden for some systems using a pure 
> expansion.
>
> The closest consistent view point with the text above on cardinality 
> seems to me, currently, to be the form that counts all paths, 
> including once round loops, which is the triple matching and 
> backtracking. Intuitively, it's like trying to create triple 
> expansions but limiting loops to once round. It feels as if it is 
> consistent possible future returning path lengths.

I would argue that node marking with backtracking also makes sense and 
is much easier to implement than edge marking with backtracking. Node 
marking is easier to implement because when we encounter two nodes with 
the same URI, we know that they are the same node, but when we encounter 
two edges with the same URI, we can't say anything about their equality. 
This fact makes cycle detection based on node repetition easier in my 
opinion, especially when using disk-based storage and joining triples 
tables.

As an example of node marking with backtracking, consider the following 
dataset and query.

:a :p :b
:b :p :z
:b :p :c
:a :p :c
:c :p :c
:c :p :z

{ :a :p+ ?y }

With node marking with backtracking, we get:
:a -:p-> :b
:a -:p-> :b -:p-> :z
:a -:p-> :b -:p-> :c
:a -:p-> :b -:p-> :c -:p-> :z
:a -:p-> :c
:a -:p-> :c -:p-> :z

This is exactly all simple paths that match the path expression and also 
what we would get by doing a recursive SQL query against an s,p,o 
triples table.

With edge marking with backtracking, we get the following additional paths:
:a -:p-> :b -:p-> :c -:p-> :c
:a -:p-> :b -:p-> :c -:p-> :c -:p-> :z
:a -:p-> :c -:p-> :c
:a -:p-> :c -:p-> :c -:p-> :z


>
> The other consistent viewpoint, which is not in line with the text 
> above, is to make path matching distinct only. That drops the proposal 
> for fixed length paths being equivalent to pure triple patterns.
>
> Simply picking distinct everywhere seems restrictive.
>
> We also have the fact that while we have used :p as the single step 
> path component, it can be a complex expression:
>
>   ?x (:a/:b)+ ?y
>   ?x (:a*/:b)+ ?y
>   ?x (:a|:b)+ ?y
>   ?x (:a/(:b|^:c))+ ?y
>
> Currently, I'm working on the idea of some fixed equivalences of 
> transformations of paths down to a few operators. We at least get 
> defined cardinalities this way, and an explanation although the 
> treatment isn't going to be uniform as
>
> card(:p+) = card(:p{1,}) != card(:p{1,larger than graph})
>
> There would be an operator path+, and the expansions always put the 
> unbounded part last (that does not mean the unbounded part is last in 
> t the path -- the path may be :p+/:q{5}).
>
> pathArbLengthMatch(path)
>
> but we could start with
>
> pathArbLengthMatch(path, min, max)
>
> making the break with triple pattern equivalences happen when any {} 
> is introduced.  "+" is {1,inf} and "*" is :p{0} union :p+
>
> Operators like "/", "|", "^" are simply syntax for triple expansions.
>
> The unfortunate consequence is that
> {n} is not a simple triple pattern if it is {n,n}.
>
> How does that fit with recursive SQL?

The formalization approach looks good to me. My main problem is the 
edge-marking vs node-marking issue. Since we can never make the 
cardinalities for fixed length and unbounded path expressions agree 
completely, I think it makes sense to use the node marking approach for 
efficiency reasons and also for agreement with SQL (assuming an s,p,o triples
table where the recursion is prior.o = current.s).


>
>     Andy
>
>
> On 09/06/2010 1:40 PM, Matthew Perry wrote:
>> Thanks for the summary Andy.
>>
>> I do not like the triple marking with backtracking/unwinding approach 
>> to determine the cardinality for arbitrary length paths. I think this 
>> will be too hard to implement efficiently on systems that store 
>> graphs on disk. Using "distinct" or "no node repetition" semantics, 
>> most property path queries can be evaluated efficiently in a DBMS 
>> using recursive SQL, but the triple marking approach is a different 
>> story and would most likely require a really convoluted 
>> representation on disk. Since we are only talking about differences 
>> in cardinality of results, I worry that the elegance of the 
>> formalization may not justify the additional burden on developers.
>>
>> Cheers,
>> Matt
>>
>> ----- Original Message -----
>> From: andy.seaborne@talis.com
>> To: public-rdf-dawg@w3.org
>> Sent: Wednesday, June 9, 2010 8:15:06 AM GMT -05:00 US/Canada Eastern
>> Subject: [TF-PP] Moving forward
>>
>> To summarise where I think we are:
>>
>> 1/ Use the expansions approach that Birte proposed in:
>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2010AprJun/0293.html
>>
>> 2/ Define operators for zero length paths and for arbitrary length paths
>> (i.e. +).
>>
>> 2a/ zero length paths include all subjects/objects/constant terms as
>> possible bindings (and check this makes sense).
>>
>> 2b/ arbitrary length paths based on triple marking with
>> backtracking/unwinding.  Define the operator by algorithm (of course,
>> you can implement anyway you want but this defines the correct answers
>> and cardinality).
>>
>> 3/ Need to compare the cardinality for arbitrary length paths and {n,m}
>> forms since they should agree, and if not, we need to understand why
>> (this is why I think the triple marking with unwinding is the way to go)
>> It may make sense to define {n,m} in terms of the arbitrary length paths
>> algorithm.
>>
>> 4/ Examples, examples, examples.
>>
>>     Andy
>>
>
Received on Tuesday, 15 June 2010 12:47:44 UTC