Re: Fwd: Re: Comments about the semantics of property paths from Matthew Perry on 2011-01-26 (public-rdf-dawg@w3.org from January to March 2011)

From: Matthew Perry <matthew.perry@oracle.com>
Date: Wed, 26 Jan 2011 12:56:18 -0500
CC: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4D406042.6070003@oracle.com>
Forget that last part ... temporary brain freeze.

- Matt

On 1/26/2011 8:53 AM, Matthew Perry wrote:
> I think this should work for the sum age query.
>
> SELECT SUM(?A)
> WHERE
> { ?F :age ?A
>  { SELECT DISTINCT ?F
>    WHERE
>     { :me (:friend)+ ?F } }}
>
> It seems that, in general, DISTINCT will work as a there exists query when URIs and blank nodes are the endpoint of a path because we have a one-to-one mapping between URI/BN values and graph vertices, but we get a one-to-many mapping from literal values to vertices.
>
> - Matt
>
>
> On 1/26/2011 8:24 AM, Andy Seaborne wrote:
>> Here's an example. based on items in a (simple) shopping basket. There are two things that are :item1 in the order (via compound and directly - think "buying a spare part").
>>
>> The order cost is 6, uniqueness would make it 2.
>>
>> (I've sent a version of this to Jorge offlist as discussions aren't possible on the comments list).
>>
>>     Andy
>>
>> ==== D1.ttl
>>
>> @prefix : <http://example/> .
>>
>> :order :contains :thing1 .
>> :order :contains :compound1 .
>>
>> :thing1 :unitOf :item1 .
>> :thing2 :unitOf :item2 .
>> :thing3 :unitOf :item1 .
>>
>> :item2 :price 2 .
>> :item1 :price 2 .
>>
>> :compound1 :contains :thing2 .
>> :compound1 :contains :thing3 .
>>
>>
>> ==== Q1.rq
>>
>> PREFIX : <http://example/>
>>
>> SELECT (SUM(?itemPrice) AS ?price)
>> {
>>   :order :contains+/:unitOf/:price ?itemPrice .
>> }
>>
>>
>>
>> -------- Original Message --------
>> Subject: Re: Comments about the semantics of property paths
>> Date: Tue, 25 Jan 2011 15:55:08 -0300
>> From: jorge perez <jorge.perez.rojas@gmail.com>
>> To: Andy Seaborne <andy.seaborne@epimorphics.com>
>> CC: public-rdf-dawg-comments@w3.org
>>
>> Hello Andy,
>>
>> Thanks for your email. Yes, my comments have been answered. An
>> additional comment is below.
>>
>> On Tue, Jan 25, 2011 at 1:08 PM, Andy Seaborne
>> <andy.seaborne@epimorphics.com> wrote:
>>> Hi Jorge,
>>>
>>> XPath is designed for XML processing where XML nodes and values are treated
>>> in different ways. XPath evaluation returns distinct XML nodes, but
>>> duplicate values. One evaluation of an XPath expression can't mix XML nodes
>>> and values - see the numbered list in [1]. "XQuery 1.0 and XPath 2.0
>>> Functions and Operators" has an operation fn:distinct-values to make values
>>> unique in a sequence [2].
>>>
>>> An RDF graph does not have this distinction of nodes and values. Graph nodes
>>> (vertexes) are IRIs, blank nodes or literals with no separation. Repetition
>>> of literals is significant, consider SUM applied to a purchase order where
>>> two items have the same price, so multiple paths to the same endpoint do
>>> matter.
>>
>> Aggregation is actually another reason of why multiple paths to the
>> same endpoint *do not* have to be considered.
>>
>> Consider a network of friends, and assume that you want to obtain the
>> SUM of the age of all your network (friends of your friends). Then a
>> very natural way to do this is with the query (simplified syntax)
>>
>> SUM (?A)
>> :me (:friend)+/:age ?A
>>
>> The query is navigating to all the friends of my friends, then to the
>> age value of every one, and then taking the SUM. Isn't this natural?
>> But, consider the following data
>>
>> :me :friend :f1
>> :me :friend :f2
>> :f1 :friend :f2
>> :f1 :age 20
>> :f2 :age 20
>>
>> I would expect 40 as the result of the above query, but the expression
>>
>> :me (:fiend)+/:age ?A
>>
>> returns
>>
>> ?A
>> 20 (for the path :me->:f1)
>> 20 (for the path :me->:f2)
>> 20 (for the path :me->:f1->:f2)
>>
>> and thus, the answer of the SUM would be 60. How do you explain the
>> result of this query to a user? Notice that using DISTINCT does not
>> solve the problem, since with DISTINCT you would obtain 20 as the SUM
>> which is also wrong.
>>
>> Is there a way to correctly answer the above query with the current
>> design of property paths?
>>
>> Thanks,
>> - jorge
>>
>>>
>>> SPARQL property paths do not apply uniqueness to property paths and the
>>> property path expression is, where appropriate, the same the expansion in
>>> terms of triple patterns. It is not a matter of efficiency because the
>>> answers concerning duplicate literal values would be rather unexpected if
>>> only distinct values were returned.
>>>
>>> This leaves the ArbitraryLengthPath operation for the use of "+" in paths.
>>> This traverses cycles once by terminating the search on encountering an edge
>>> already traversed for that evaluation of ArbitraryLengthPath. In an earlier
>>> design, cycle termination was by detecting visiting nodes but the WG
>>> considers the edge traversal a better choice. The new design is one more
>>> step of evaluation on a cycle than the first design and leaves better
>>> prospects for future standardization.
>>>
>>> SPARQL has the keyword DISTINCT so an application can choose between
>>> duplicates and no duplicates. A query engine can exploit this if it chooses
>>> to; use with sub-queries mean that solution modifiers can be applied to
>>> specific parts of the query such as a path.
>>>
>>> An implementation is free to implement evaluation in anyway it chooses
>>> proved it results in the same answers. The WG felt that using an algorithm
>>> was the most helpful way to specify the feature, especially to implementers.
>>>
>>> Property paths have been implemented in a number of systems (see [3] for a
>>> partial list) and found to be useful.
>>>
>>> We would be grateful if you would acknowledge that your comment has been
>>> answered by sending a reply to this mailing list.
>>>
>>> Andy
>>> On behalf of the SPARQL working group.
>>>
>>> [1] http://www.w3.org/TR/xpath20/#id-path-expressions
>>> [2] http://www.w3.org/TR/xpath-functions/#func-distinct-values
>>> [3] http://esw.w3.org/SPARQL/Extensions/Paths
>>>
>>> On 15/12/10 18:34, jorge perez wrote:
>>>>
>>>> Hello Andy,
>>>>
>>>> Thank you very much for your response and for considering my comments,
>>>> and sorry for the late reply.
>>>>
>>>> There is a couple of comments that you have not answered.
>>>>
>>>> ""
>>>> As a separate but very important issue, notice that the XPath language
>>>> does not consider duplicate paths when evaluating expressions (XPath
>>>> is evaluated in the "there exists" way that I mentioned before). Thus,
>>>> counting paths in SPARQL would be somewhat in contradiction with
>>>> previously proposed path languages considered by the W3C.
>>>> ""
>>>>
>>>> I think that if this W3C Recommendation is in discordance with a
>>>> previous Recommendation about a similar topic, then DAWG should have
>>>> strong reasons for that, and make them clear in the specification. The
>>>> specification should also advice the reader about this issue.
>>>>
>>>> Besides that comment, you have said nothing about efficiency of
>>>> evaluation. Notice that this not related to a particular way of
>>>> implementing the language. It is about the huge efficiency impact that
>>>> any implementation will suffer in practice. You have not acknowledge
>>>> that in your response. Have you consider this as an issue?
>>>>
>>>> Another comment that is not covered by your response is whether there
>>>> exists a use case that demand counting different paths. In your
>>>> response, it seems that the reason for counting paths is to make
>>>> easier the job of the implementors (by reusing algebra operators).
>>>> Opposite to what the group think, I think that not counting paths
>>>> gives the implementor more freedom since paths could be implemented in
>>>> several different ways, being just one of them by reusing algebra
>>>> operators. Can you please clarify whether there are use cases about
>>>> this? This would help a lot.
>>>>
>>>> If you respond to the comments above I can consider my comments answered.
>>>>
>>>> I have a couple of additional words. Please do not consider them as a
>>>> formal objection to the process, but just as my opinion.
>>>>
>>>> I still strongly disagree with your design decisions about property
>>>> paths. In particular, I insist that it is a mistake to define the
>>>> semantics in the presence of cycles in a non-standard way and by
>>>> forcing a particular algorithm to evaluate them. In your response you
>>>> say that there can be corner cases, but it is not only a problem of
>>>> corner cases. From my point of view it will become a problem of
>>>> adoption of the standard. In this point I think that the group should
>>>> not neglect that there is a lot of related (theoretical and practical)
>>>> work in this area that have handled cycles in a completely different
>>>> way.
>>>>
>>>> To conclude, I do think that the property-paths material in the
>>>> current specification is far from being mature. Considering that the
>>>> group is in a tight schedule, I think that it would be better to not
>>>> include property paths in this round of standardization, than
>>>> including them in their current form.
>>>>
>>>> Thank you very much for considering my comments.
>>>> - jorge
>>>>
>>>> On Thu, Dec 2, 2010 at 8:20 AM, Andy Seaborne
>>>> <andy.seaborne@epimorphics.com>  wrote:
>>>>>
>>>>> Jorge,
>>>>>
>>>>> Thank very much for your comments.
>>>>>
>>>>> The working group considered a number of factors in designing the
>>>>> property
>>>>> path features. In addition to the points you raise, the WG also included
>>>>> consideration that, while this working group is not adding a path
>>>>> datatype
>>>>> (needed to inquire about any path matched later in the query), nor the
>>>>> specific case of access to path length, the WG should leave open as many
>>>>> possibilities here for future work. Another factor in the design is the
>>>>> relationship of some property path expressions to triple pattern forms.
>>>>>
>>>>> Although not specifying returning the path length of a match, nor
>>>>> specifying
>>>>> returning the matched path itself, the WG felt that, on balance, the
>>>>> design
>>>>> in the working draft gave maximum scope for any later standardization
>>>>> work.
>>>>> The issue of path length particularly was considered as a feature for
>>>>> this
>>>>> round of work but, when considered against all the other work items the
>>>>> WG
>>>>> has taken on, it didn't make the final list of work items. This lead to
>>>>> the
>>>>> conclusion that counting path possibilities, not a "there exists"
>>>>> condition,
>>>>> was the better choice for this round of standardization. Adding access
>>>>> the
>>>>> the path matched is better served if all paths are considered.
>>>>>
>>>>> Another consideration was the relationship of property paths and existing
>>>>> queries using triple patterns.
>>>>>
>>>>> { ?x :p{2} ?y }
>>>>>
>>>>> and
>>>>>
>>>>> { ?x :p ?Z . ?Z :p ?y }, with ?Z projected away.
>>>>>
>>>>> The WG decided to make these equivalent, including in terms of numbers of
>>>>> solutions. This gives the semantics of many path forms in terms of SPARQL
>>>>> graph pattern operators. This was felt to be intuitive and to utilize the
>>>>> capabilities of query engines: rather that requiring yet another
>>>>> mechanism,
>>>>> the equivalence means that join-technology (for example) can be used to
>>>>> solve the pattern.
>>>>>
>>>>> This then leaves the issue of cycles in the "+" operator. The design is
>>>>> one
>>>>> in which the cycles in "+" operator are handled by traversing a directed
>>>>> edge (triple in the data) once. This will be explained in the final
>>>>> version
>>>>> of the query specification - there is a placeholder for it in the current
>>>>> editors working draft. The current working draft has been clarified to
>>>>> use
>>>>> "multiset-union" for the union in the ArbitraryLengthPath definition.
>>>>>
>>>>> This overall design is a tradeoff of implementation, future
>>>>> possibilities,
>>>>> and equivalence of patterns on graphs. The WG is aware that there can be
>>>>> corner cases can arise where different intuitions are not compatible. On
>>>>> balance, the WG feels that the current design is most suitable for this
>>>>> round of standardization.
>>>>>
>>>>> Again, that you for your helpful comments.
>>>>>
>>>>> We would be grateful if you would acknowledge that your comment has been
>>>>> answered by sending a reply to this mailing list.
>>>>>
>>>>> Andy
>>>>> on behalf of the SPARQL Working Group.
>>>>>
>>>
>>
>
>
Received on Wednesday, 26 January 2011 17:57:34 UTC