Re: ISSUE-139: uniform descriptions and implementations of constraint components from Peter F. Patel-Schneider on 2016-06-09 (public-data-shapes-wg@w3.org from June 2016)

From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
Date: Wed, 8 Jun 2016 19:12:12 -0700
To: Holger Knublauch <holger@topquadrant.com>, public-data-shapes-wg <public-data-shapes-wg@w3.org>
Message-ID: <d5e959c9-479e-8201-5a36-37a17aac6e52@gmail.com>
So what is the cost of executing something like
  SELECT $this
  WHERE { FILTER NOT EXISTS { $this $predicate $hasValue } }
vs
  SELECT ?this
  WHERE { FILTER NOT EXISTS { $this $predicate ?value
                              FILTER ( sameTerm(?value,$hasValue) ) } }
with $this, $predicate, and $hasValue pre-bound according to SHACL.

Well, for starters it's not possible to tell because the definition of
pre-binding is still broken in SHACL.  However, let's assume that it is
something like joining a solution set to the body of the select.  That works
for this example even though it diverges from the description of pre-binding
in SHACL.

As well, the outer part is the same, so let's just look at the inner part.

The first turns out to be defined as
  BGP(<something>,<something>,<something>)
where the three somethings are the pre-bindings.  The cost of this is just the
cost of a fully-specified lookup in the triple store.

But what is the cost of such a lookup?  The cost could range widely even up to
the cost of iterating over all triples.  But what are the likely costs?  If
the triple store is fully indexed then this will roughly be the cost of
initiating a triple access, the cost of running through the index, and maybe
the cost of getting the triple.  (This last might not be necessary, depending
on exactly how the index works.)  Even these costs can vary wildly, but it is
likely that the constant costs will be quite high.  If the triple store is not
fulling indexed, for example, if only subject-predicate pairs are indexed, the
cost is going to be the same as the case below.

The second turns out to be defined as
  u = BGP(<something>,<something>,?value)
  { u | ?value and <something> are the same RDF term }

What is the cost of evaluating this?  This is even harder to determine.  It is
an easy optimization to turn this into the same as the previous lookup, in
which case there is no difference.

If this optimization is not done, then the cost is likely going to be the cost
of a partially specified lookup and then iterating through the values.  The
lookup is likely to be a bit cheaper than the cost of a fully specified
lookup.  Each iteration is almost certainly going to be cheaper than the
lookup, and likely to be much cheaper, so if the number of values is not large
then the total cost is going to be close to the cost of a single lookup.

So how many values can one expect?  That depends on the data.  In some cases
there can be a lot of values, but the average number is very likely to be small.

So in the end the difference, even in the unoptimized case, is likely to be
quite modest, certainly nowhere near a factor of n / log n, for n the size of
the triple store.

Peter F. Patel-Schneider
Nuance Commmunications




On 06/06/2016 04:45 PM, Holger Knublauch wrote:
> On 6/06/2016 22:14, Peter F. Patel-Schneider wrote:
>> As far as I can tell, there are not going to be any significant inefficiencies
>> in a single-implementation setup.  Even if the boilerplate solution is the
>> only possibility implementations of constraint components come down to
>> starting out with the boilerplate and adding to it the code that implements
>> the constraint component for property constraints.
>>
>> There are, admittedly, some potential inefficiencies in the boilerplate
>> solution as the boilerplate is not modifiable.  For example, sh:hasValue will
>> look something like
>>
>> SELECT $this ...
>> WHERE { FILTER NOT EXISTS { [boilerplate]
>>                               FILTER ( sameTerm(?value,$hasValue) ) } }
>>
>> If the SPARQL implementation cannot optimize out the query followed by a
>> simple filter then the above query will run slower than
>>
>> SELECT $this ...
>> WHERE { FILTER NOT EXISTS { $this $predicate $hasValue } }
> 
> I think you have contradicted yourself in this email. Yes, these
> inefficiencies do exist and they are significant. The boilerplate solution
> would first need to iterate over all potential values of the property, i.e.
> have O(n) performance plus the overhead of a FILTER clause, while the direct
> query has O(1) or O(log(N)) performance via a direct database lookup. A
> crippled SHACL that doesn't allow users to benefit from database optimizations
> will fail on the marketplace, and vendors will provide all kinds of native
> extensions to work around the limits of the standard.
> 
> Even if there was a mechanism for defining a single query for every case and
> every constraint component (which I doubt), then we still require a mechanism
> to overload them for these optimizations. So, I would be OK to having
> sh:defaultValidator as long as sh:propertyValidator remains in place.
> 
> Holger
>
Received on Thursday, 9 June 2016 02:12:43 UTC