Re: ISSUE-139: uniform descriptions and implementations of constraint components from Holger Knublauch on 2016-06-09 (public-data-shapes-wg@w3.org from June 2016)

From: Holger Knublauch <holger@topquadrant.com>
Date: Thu, 9 Jun 2016 13:03:14 +1000
To: public-data-shapes-wg <public-data-shapes-wg@w3.org>
Message-ID: <9692808f-2cc1-de4b-cbc3-a7861e253ece@topquadrant.com>
On 9/06/2016 12:12, Peter F. Patel-Schneider wrote:
> So what is the cost of executing something like
>    SELECT $this
>    WHERE { FILTER NOT EXISTS { $this $predicate $hasValue } }
> vs
>    SELECT ?this
>    WHERE { FILTER NOT EXISTS { $this $predicate ?value
>                                FILTER ( sameTerm(?value,$hasValue) ) } }
> with $this, $predicate, and $hasValue pre-bound according to SHACL.
>
> Well, for starters it's not possible to tell because the definition of
> pre-binding is still broken in SHACL.  However, let's assume that it is
> something like joining a solution set to the body of the select.  That works
> for this example even though it diverges from the description of pre-binding
> in SHACL.
>
> As well, the outer part is the same, so let's just look at the inner part.
>
> The first turns out to be defined as
>    BGP(<something>,<something>,<something>)
> where the three somethings are the pre-bindings.  The cost of this is just the
> cost of a fully-specified lookup in the triple store.
>
> But what is the cost of such a lookup?  The cost could range widely even up to
> the cost of iterating over all triples.  But what are the likely costs?  If
> the triple store is fully indexed then this will roughly be the cost of
> initiating a triple access, the cost of running through the index, and maybe
> the cost of getting the triple.  (This last might not be necessary, depending
> on exactly how the index works.)  Even these costs can vary wildly, but it is
> likely that the constant costs will be quite high.  If the triple store is not
> fulling indexed, for example, if only subject-predicate pairs are indexed, the
> cost is going to be the same as the case below.
>
> The second turns out to be defined as
>    u = BGP(<something>,<something>,?value)
>    { u | ?value and <something> are the same RDF term }
>
> What is the cost of evaluating this?  This is even harder to determine.  It is
> an easy optimization to turn this into the same as the previous lookup, in
> which case there is no difference.
>
> If this optimization is not done, then the cost is likely going to be the cost
> of a partially specified lookup and then iterating through the values.  The
> lookup is likely to be a bit cheaper than the cost of a fully specified
> lookup.  Each iteration is almost certainly going to be cheaper than the
> lookup, and likely to be much cheaper, so if the number of values is not large
> then the total cost is going to be close to the cost of a single lookup.
>
> So how many values can one expect?  That depends on the data.  In some cases
> there can be a lot of values, but the average number is very likely to be small.
>
> So in the end the difference, even in the unoptimized case, is likely to be
> quite modest, certainly nowhere near a factor of n / log n, for n the size of
> the triple store.

To be clear, I did not mean N to be the size of the triple store. It 
would be the size of the set of value nodes, e.g. the number of objects 
of a subject/predicate combination.

But basically you are saying there are lots of unknowns - it depends on 
the data, on the available indices etc. I maintain my position that the 
difference is most likely N to log(N), because I assume most databases 
will have SP->O indices.

Anyway, you are the one who is suggesting that we should delete the 
ability to specify optimized queries for each of the three cases. So the 
burden is upon you to provide convincing evidence that for every 
possible constraint component, the difference between a generic query 
and a specialized query is negligible. I cannot see how you can possibly 
provide such evidence. There are far more complicating use cases "out 
there" than the trivial sh:hasValue scenario.

Holger



>
> Peter F. Patel-Schneider
> Nuance Commmunications
>
>
>
>
> On 06/06/2016 04:45 PM, Holger Knublauch wrote:
>> On 6/06/2016 22:14, Peter F. Patel-Schneider wrote:
>>> As far as I can tell, there are not going to be any significant inefficiencies
>>> in a single-implementation setup.  Even if the boilerplate solution is the
>>> only possibility implementations of constraint components come down to
>>> starting out with the boilerplate and adding to it the code that implements
>>> the constraint component for property constraints.
>>>
>>> There are, admittedly, some potential inefficiencies in the boilerplate
>>> solution as the boilerplate is not modifiable.  For example, sh:hasValue will
>>> look something like
>>>
>>> SELECT $this ...
>>> WHERE { FILTER NOT EXISTS { [boilerplate]
>>>                                FILTER ( sameTerm(?value,$hasValue) ) } }
>>>
>>> If the SPARQL implementation cannot optimize out the query followed by a
>>> simple filter then the above query will run slower than
>>>
>>> SELECT $this ...
>>> WHERE { FILTER NOT EXISTS { $this $predicate $hasValue } }
>> I think you have contradicted yourself in this email. Yes, these
>> inefficiencies do exist and they are significant. The boilerplate solution
>> would first need to iterate over all potential values of the property, i.e.
>> have O(n) performance plus the overhead of a FILTER clause, while the direct
>> query has O(1) or O(log(N)) performance via a direct database lookup. A
>> crippled SHACL that doesn't allow users to benefit from database optimizations
>> will fail on the marketplace, and vendors will provide all kinds of native
>> extensions to work around the limits of the standard.
>>
>> Even if there was a mechanism for defining a single query for every case and
>> every constraint component (which I doubt), then we still require a mechanism
>> to overload them for these optimizations. So, I would be OK to having
>> sh:defaultValidator as long as sh:propertyValidator remains in place.
>>
>> Holger
>>
Received on Thursday, 9 June 2016 03:03:49 UTC