Re: ISSUE-139: uniform descriptions and implementations of constraint components from Holger Knublauch on 2016-06-17 (public-data-shapes-wg@w3.org from June 2016)

From: Holger Knublauch <holger@topquadrant.com>
Date: Fri, 17 Jun 2016 11:02:34 +1000
To: public-data-shapes-wg <public-data-shapes-wg@w3.org>
Message-ID: <d521acb8-c8d1-bcab-8aab-8a9335af8d64@topquadrant.com>

I don't want to get into long email discussions about whether O(n) is 
comparable to O(log(n)) and under what circumstances. As computer 
scientists we hopefully agree on the maths, and heuristics are of course 
very speculative.

I had proposed a variation of the boilerplate code that would satisfy 
this problem here, because it would allow the query to use patterns like

     $this $PATH $hasValue .

https://www.w3.org/2014/data-shapes/wiki/Proposals#Topic_2:_Boilerplate_Macro_injection_for_SPARQL

This would resolve the performance difference and generally have better 
expressivity than only inserting the boilerplate into the beginning of 
the query. I hope we can move forward on that basis.

Holger


On 9/06/2016 21:07, Peter F. Patel-Schneider wrote:
> On 06/08/2016 08:03 PM, Holger Knublauch wrote:
> [...]
>>> So in the end the difference, even in the unoptimized case, is likely to be
>>> quite modest, certainly nowhere near a factor of n / log n, for n the size of
>>> the triple store.
>> To be clear, I did not mean N to be the size of the triple store. It would be
>> the size of the set of value nodes, e.g. the number of objects of a
>> subject/predicate combination.
>>
>> But basically you are saying there are lots of unknowns - it depends on the
>> data, on the available indices etc. I maintain my position that the difference
>> is most likely N to log(N), because I assume most databases will have SP->O
>> indices.
> So N vs log(N) for N that is very likely to be a small number, often 1.  Here
> the difference between N and log(N) is not significant so the constants are
> going to dominate.  Which constants are going to dominate?  It depends on how
> the triple store is designed, but it is likely that there will be a large
> constant that is the same for both, so the difference is likely to be small.
>
> And this is for the case that the SPARQL implementation does not do an easy
> optimization.  (I don't have much faith that most SPARQL implementations
> actually do this easy optimization, but implementing the current design of
> SHACL is going to require significant additions to SPARQL so adding this
> optimization to a query optimizer is likely going to be easy in comparison.)
>
>> Anyway, you are the one who is suggesting that we should delete the ability to
>> specify optimized queries for each of the three cases. So the burden is upon
>> you to provide convincing evidence that for every possible constraint
>> component, the difference between a generic query and a specialized query is
>> negligible. I cannot see how you can possibly provide such evidence. There are
>> far more complicating use cases "out there" than the trivial sh:hasValue
>> scenario.
> What other cases?  Any constraint component that requires iterating over the
> value nodes is going to have the same cost of this iteration.  The only
> difference is going to be whether the SPARQL implementation can optimize the
> boilerplate so that it doesn't do both the forward and inverse lookups.  This
> is another easy optimization.
>
> The only other core constraint components that don't need to iterate over the
> value nodes are sh:not, sh:and, and sh:or, but these don't themselves do any
> data access.  That leaves sh:hasValue as the only core constraint component
> where the actual data access is different.  I also don't know of any non-core
> constraint component that doesn't need to iterate over the value nodes.
>
>> Holger
> peter

Received on Friday, 17 June 2016 01:03:18 UTC