Re: ISSUE-139: uniform descriptions and implementations of constraint components from Holger Knublauch on 2016-06-07 (public-data-shapes-wg@w3.org from June 2016)

From: Holger Knublauch <holger@topquadrant.com>
Date: Tue, 7 Jun 2016 16:24:24 +1000
To: public-data-shapes-wg <public-data-shapes-wg@w3.org>
Message-ID: <e042542f-0e10-05c2-b675-a78c28100414@topquadrant.com>
On 7/06/2016 16:02, Dimitris Kontokostas wrote:
>
>
> On Tue, Jun 7, 2016 at 2:45 AM, Holger Knublauch 
> <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>
>     On 6/06/2016 22:14, Peter F. Patel-Schneider wrote:
>
>         As far as I can tell, there are not going to be any
>         significant inefficiencies
>         in a single-implementation setup.  Even if the boilerplate
>         solution is the
>         only possibility implementations of constraint components come
>         down to
>         starting out with the boilerplate and adding to it the code
>         that implements
>         the constraint component for property constraints.
>
>         There are, admittedly, some potential inefficiencies in the
>         boilerplate
>         solution as the boilerplate is not modifiable.  For example,
>         sh:hasValue will
>         look something like
>
>         SELECT $this ...
>         WHERE { FILTER NOT EXISTS { [boilerplate]
>                                       FILTER (
>         sameTerm($this,$hasValue) ) } }
>
>         If the SPARQL implementation cannot optimize out the query
>         followed by a
>         simple filter then the above query will run slower than
>
>         SELECT $this ...
>         WHERE { FILTER NOT EXISTS { $this $predicate $hasValue } }
>
>
>     I think you have contradicted yourself in this email. Yes, these
>     inefficiencies do exist and they are significant. The boilerplate
>     solution would first need to iterate over all potential values of
>     the property, i.e. have O(n) performance plus the overhead of a
>     FILTER clause, while the direct query has O(1) or O(log(N))
>     performance via a direct database lookup. A crippled SHACL that
>     doesn't allow users to benefit from database optimizations will
>     fail on the marketplace, and vendors will provide all kinds of
>     native extensions to work around the limits of the standard.
>
>     Even if there was a mechanism for defining a single query for
>     every case and every constraint component (which I doubt), then we
>     still require a mechanism to overload them for these
>     optimizations. So, I would be OK to having sh:defaultValidator as
>     long as sh:propertyValidator remains in place.
>
>
> Personally I would take it a small step further to achieve further 
> optimizations. i.e. have a sh:defaultValidator and then zero or more 
> sh:filteredValidators.
> A filtered validator would override the default validator based on
> 1) context (as we do already)
> 2) parameter values (e.g. for sh:minCount = 1)
> 3) platform specific information (e.g. sparql engine, sparql version etc)
>
> This is already supported in RDFUnit (mainly #2 now) and it is defined 
> with an ASK query like "ASK { FILTER ($minCount = 1)}" / "ASK { FILTER 
> ($minCount > 1)}"

Sounds very good to me. I guess what we would need is a new property at 
the SPARQL validators to point at zero or one such preconditions. As you 
state they could be ASK queries, assuming we combine it with some 
integer for the ordering (otherwise they would all need to be completely 
disjoint). It would be a generic solution to things like vendor-specific 
optimizations. So maybe

ex:MyValidator
     a sh:SPARQLAskValidator ;
     sh:ask "... the actual query ..." ;
     sh:order 3 ;
     sh:filter "... return true if applicable ..." .

(using sh:order would allow an engine to start with the most likely 
match first, and would make the filter logic simpler).

Do you have experience as to how complex these pre-conditions would 
become? And would they need to operate on the data graph or shapes 
graph? The latter may make a performance difference as the selection 
would just need to be executed once per shapes graph. Yet I believe 
access to the data graph may be needed. In some cases a query may simple 
be of the form ASK { FILTER bound(?productX) } which does not even 
require a look up on any graph. I also expect queries to look slightly 
different depending on the type of database.

Holger
Received on Tuesday, 7 June 2016 06:24:57 UTC