- From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
- Date: Wed, 8 Jun 2016 19:12:12 -0700
- To: Holger Knublauch <holger@topquadrant.com>, public-data-shapes-wg <public-data-shapes-wg@w3.org>
So what is the cost of executing something like SELECT $this WHERE { FILTER NOT EXISTS { $this $predicate $hasValue } } vs SELECT ?this WHERE { FILTER NOT EXISTS { $this $predicate ?value FILTER ( sameTerm(?value,$hasValue) ) } } with $this, $predicate, and $hasValue pre-bound according to SHACL. Well, for starters it's not possible to tell because the definition of pre-binding is still broken in SHACL. However, let's assume that it is something like joining a solution set to the body of the select. That works for this example even though it diverges from the description of pre-binding in SHACL. As well, the outer part is the same, so let's just look at the inner part. The first turns out to be defined as BGP(<something>,<something>,<something>) where the three somethings are the pre-bindings. The cost of this is just the cost of a fully-specified lookup in the triple store. But what is the cost of such a lookup? The cost could range widely even up to the cost of iterating over all triples. But what are the likely costs? If the triple store is fully indexed then this will roughly be the cost of initiating a triple access, the cost of running through the index, and maybe the cost of getting the triple. (This last might not be necessary, depending on exactly how the index works.) Even these costs can vary wildly, but it is likely that the constant costs will be quite high. If the triple store is not fulling indexed, for example, if only subject-predicate pairs are indexed, the cost is going to be the same as the case below. The second turns out to be defined as u = BGP(<something>,<something>,?value) { u | ?value and <something> are the same RDF term } What is the cost of evaluating this? This is even harder to determine. It is an easy optimization to turn this into the same as the previous lookup, in which case there is no difference. If this optimization is not done, then the cost is likely going to be the cost of a partially specified lookup and then iterating through the values. The lookup is likely to be a bit cheaper than the cost of a fully specified lookup. Each iteration is almost certainly going to be cheaper than the lookup, and likely to be much cheaper, so if the number of values is not large then the total cost is going to be close to the cost of a single lookup. So how many values can one expect? That depends on the data. In some cases there can be a lot of values, but the average number is very likely to be small. So in the end the difference, even in the unoptimized case, is likely to be quite modest, certainly nowhere near a factor of n / log n, for n the size of the triple store. Peter F. Patel-Schneider Nuance Commmunications On 06/06/2016 04:45 PM, Holger Knublauch wrote: > On 6/06/2016 22:14, Peter F. Patel-Schneider wrote: >> As far as I can tell, there are not going to be any significant inefficiencies >> in a single-implementation setup. Even if the boilerplate solution is the >> only possibility implementations of constraint components come down to >> starting out with the boilerplate and adding to it the code that implements >> the constraint component for property constraints. >> >> There are, admittedly, some potential inefficiencies in the boilerplate >> solution as the boilerplate is not modifiable. For example, sh:hasValue will >> look something like >> >> SELECT $this ... >> WHERE { FILTER NOT EXISTS { [boilerplate] >> FILTER ( sameTerm(?value,$hasValue) ) } } >> >> If the SPARQL implementation cannot optimize out the query followed by a >> simple filter then the above query will run slower than >> >> SELECT $this ... >> WHERE { FILTER NOT EXISTS { $this $predicate $hasValue } } > > I think you have contradicted yourself in this email. Yes, these > inefficiencies do exist and they are significant. The boilerplate solution > would first need to iterate over all potential values of the property, i.e. > have O(n) performance plus the overhead of a FILTER clause, while the direct > query has O(1) or O(log(N)) performance via a direct database lookup. A > crippled SHACL that doesn't allow users to benefit from database optimizations > will fail on the marketplace, and vendors will provide all kinds of native > extensions to work around the limits of the standard. > > Even if there was a mechanism for defining a single query for every case and > every constraint component (which I doubt), then we still require a mechanism > to overload them for these optimizations. So, I would be OK to having > sh:defaultValidator as long as sh:propertyValidator remains in place. > > Holger >
Received on Thursday, 9 June 2016 02:12:43 UTC