Re: ISSUE-139: uniform descriptions and implementations of constraint components from Peter F. Patel-Schneider on 2016-06-09 (public-data-shapes-wg@w3.org from June 2016)

From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
Date: Thu, 9 Jun 2016 04:07:33 -0700
To: Holger Knublauch <holger@topquadrant.com>, public-data-shapes-wg <public-data-shapes-wg@w3.org>
Message-ID: <a1e47d04-5f87-7b30-d35a-cd65bd81504f@gmail.com>

On 06/08/2016 08:03 PM, Holger Knublauch wrote:
[...]
>> So in the end the difference, even in the unoptimized case, is likely to be
>> quite modest, certainly nowhere near a factor of n / log n, for n the size of
>> the triple store.
> 
> To be clear, I did not mean N to be the size of the triple store. It would be
> the size of the set of value nodes, e.g. the number of objects of a
> subject/predicate combination.
> 
> But basically you are saying there are lots of unknowns - it depends on the
> data, on the available indices etc. I maintain my position that the difference
> is most likely N to log(N), because I assume most databases will have SP->O
> indices.

So N vs log(N) for N that is very likely to be a small number, often 1.  Here
the difference between N and log(N) is not significant so the constants are
going to dominate.  Which constants are going to dominate?  It depends on how
the triple store is designed, but it is likely that there will be a large
constant that is the same for both, so the difference is likely to be small.

And this is for the case that the SPARQL implementation does not do an easy
optimization.  (I don't have much faith that most SPARQL implementations
actually do this easy optimization, but implementing the current design of
SHACL is going to require significant additions to SPARQL so adding this
optimization to a query optimizer is likely going to be easy in comparison.)

> Anyway, you are the one who is suggesting that we should delete the ability to
> specify optimized queries for each of the three cases. So the burden is upon
> you to provide convincing evidence that for every possible constraint
> component, the difference between a generic query and a specialized query is
> negligible. I cannot see how you can possibly provide such evidence. There are
> far more complicating use cases "out there" than the trivial sh:hasValue
> scenario.

What other cases?  Any constraint component that requires iterating over the
value nodes is going to have the same cost of this iteration.  The only
difference is going to be whether the SPARQL implementation can optimize the
boilerplate so that it doesn't do both the forward and inverse lookups.  This
is another easy optimization.

The only other core constraint components that don't need to iterate over the
value nodes are sh:not, sh:and, and sh:or, but these don't themselves do any
data access.  That leaves sh:hasValue as the only core constraint component
where the actual data access is different.  I also don't know of any non-core
constraint component that doesn't need to iterate over the value nodes.

> Holger

peter

Received on Thursday, 9 June 2016 11:08:04 UTC