Re: ISSUE-29: MINUS / NOT EXISTS from Andy Seaborne on 2010-02-03 (public-rdf-dawg@w3.org from January to March 2010)

From: Andy Seaborne <andy.seaborne@talis.com>
Date: Wed, 03 Feb 2010 16:00:45 +0000
To: Steve Harris <steve.harris@garlik.com>
CC: Lee Feigenbaum <lee@thefigtrees.net>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4B699DAD.5060109@talis.com>

On 03/02/2010 3:38 PM, Steve Harris wrote:
> On 3 Feb 2010, at 14:53, Andy Seaborne wrote:
>>
>> On 03/02/2010 1:42 PM, Lee Feigenbaum wrote:
>>>
>>>> ISSUE-29
>>>> Should negation be done via a binary operator on subqueries, a binary
>>>> operator within graph patterns, or a filter+subquery? ??
>>>> This is MINUS / NOT EXISTS issue.
>>>
>>> This might be a good place to start with a test case or three. Andy, do
>>> you happen to have any cases that illustrate the difference here?
>>>
>>> (I know that personally, my preference here is pretty much "no
>>> preference".)
>>
>> You gathered:
>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2009JulSep/0030.html
>>
>>
>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2009JulSep/0137.html
>>
>> shows that one can be used to implement the other in the majority of
>> cases (it's the nested optionals case we had in SPARQL 1.0 in a
>> different guise) but we don't have a definition of MINUS to accurately
>> compare. What's the cardinality in the results?
>>
>> So it's not about implementation as I understand the problem space today.
>>
>> I find it clearer to read and explain the EXISTS form - someone
>> (Simon?) said he'd asked the same of an SQL expert and EXISTS is
>> better understood by application writers than MINUS there.
>
> Wasn't that more to do with the syntax, than the semantics? I don't
> think anyone is suggesting a different syntax (from EXISTS) for minus.

Not as I remember the discussion - it was about the effect of the 
operators (i.e. semantics).

>
>> The additional condition in MINUS ("dom(μ) and dom(μ') are disjoint")
>> concerns me because I don't know what's being "MINUSed" - it's not a
>> regular set/multiset operation and certainly not like SQL MINUS which
>> is set difference.
>
> Cannot the definitions be made compatible? Either by making the corner
> cases where results differ errors, or by tweaking the definition of the
> minus operator?
>
> I'm strongly opposed to having something that looks like a binding set
> operator really be a macro for a bizarre kind of filter, or whatever it is.

NOT EXISTS is very close to OPTIONAL/filter(!bound).

It differs in that OPTIONAL/!bound can introduce artificial duplicates. 
And additional unnecessary work.

> Example of why it matters:
>
> Take: A NOT EXISTS { B }
>
> If NOT EXISTS is a binding set operator, I can bind A and B, and apply
> the logical operation across them (whatever it is) using whatever
> strategy is appropriate. Local, distributed, whatever.
>
> With the Filter definition I have to bind A, and iterate through the
> rows, checking the pattern in B rejecting rows that match in B. On
> clustered and distributed systems (which obviously I care about) this is
> very inefficient.
 >
> Now, in some cases it will be possible to tell by inspection that the
> Filter defn. has an equivalent set operation, and perform that set
> operation instead as an optimisation, but it seems like bad decision to
> require that inspection for no obvious reason.

It's most cases that are equivalent (for example, all BGP+Filter in RHS 
can be done) which si why I said it's not an implementation issue as I 
see it.

I can see how to do the NOT EXISTS with parallelism based on streaming 
in a cluster where I can't for MINUS because MINUS says "for all μ' in 
Ω2". That's a characteristic of my implementation - I don't have fully 
materialized intermediate results at all times (although I hope to find 
time to make it more sophisticated and switch strategies based on 
expected intermediate counts).

 Andy

>
> - Steve
>

Received on Wednesday, 3 February 2010 16:01:14 UTC