Re: ISSUE-139: implementing (core) constraint components universally from Holger Knublauch on 2016-06-04 (public-data-shapes-wg@w3.org from June 2016)

From: Holger Knublauch <holger@topquadrant.com>
Date: Sat, 4 Jun 2016 10:14:48 +1000
To: public-data-shapes-wg@w3.org
Message-ID: <d3e922d4-4c4d-db43-08da-950928b5ea3b@topquadrant.com>
Hi Peter,

thanks for the discussion - this is an important topic and worth 
drilling into.

On 4/06/2016 0:12, Peter F. Patel-Schneider wrote:
> My original message in this thread is mostly concerned with how to implement
> constraint components universally.  There is also a short preamble on how
> one can best describe how constraint components.  I'm going to defend both
> of these points separately, but I'm going to start with the implementation
> point as that is the bulk of my original message.
>
>
> Right now constraint components have up to three different implementations -
> one when they occur in a property constraint, one when they occur in an
> inverse property constraint, and one when they occur in a node constraint.
> This means that there are up to three different pieces of code for each
> constraint component, each (hopefully) implementing the same functionality.
> I view this as a poor setup - three different pieces of code that have to be
> written and thus three places where the bugs can be introduced.

I fully agree.

>
> Having a single implementation of each constraint component would actually
> reduce development costs.  Ideally, this single implementation would be as
> simple as the ask validators that implement many constraint components.
> Consider, for example, sh:minCount whose implementation should be very
> little more than "HAVING ( COUNT (DISTINCT ?value) < ?minCount )".

Yes, if this were possible then this would be ideal.

>    However,
> I can't figure out how to do this nicely because of limitations in SPARQL,
> hence the solution with boilerplate.

Exactly that's the same conclusion that I also have made. Furthermore I 
remember long discussions with Arthur on the phone in November. He had 
also questioned why we cannot combine all these cases. But he also did 
not come up with a better solution. If all three of us don't come up 
with a solution then maybe there is none.

>    However, even the boilerplate solution
> has only one implementation of each constraint component, and here one is
> definitely better than three and also better than two.

The boilerplate solution that you have described is already covered in 
the spec. Look at 6.5.2 on ASK validators, which enumerate these 
boilerplate snippets as "templates":

http://w3c.github.io/data-shapes/shacl/#SPARQLAskValidator

Users already have the choice to just specify a single ASK query to 
cover all three cases. In my current implementation this technique is 
used for a large number of constraint components. To reproduce, open the 
attached copy of dash.ttl and run this query:

SELECT *
WHERE {
     ?cc a sh:ConstraintComponent .
     OPTIONAL {
         ?cc sh:nodeValidator ?nodeValidator
     }
     OPTIONAL {
         ?cc sh:propertyValidator ?propValidator
     }
     OPTIONAL {
         ?cc sh:inversePropertyValidator ?invValidator
     }
}

Results:

14 constraint components currently use ASK queries:

sh:ClassConstraintComponent    dash:hasClass    dash:hasClass dash:hasClass
sh:ClassInConstraintComponent    dash:hasClassIn dash:hasClassIn    
dash:hasClassIn
sh:DatatypeConstraintComponent    dash:hasDatatype dash:hasDatatype
sh:DatatypeInConstraintComponent    dash:hasDatatypeIn dash:hasDatatypeIn
sh:InConstraintComponent    dash:isIn    dash:isIn    dash:isIn
sh:MaxExclusiveConstraintComponent    dash:hasMaxExclusive 
dash:hasMaxExclusive
sh:MaxInclusiveConstraintComponent    dash:hasMaxInclusive 
dash:hasMaxInclusive
sh:MaxLengthConstraintComponent    dash:hasMaxLength 
dash:hasMaxLength    dash:hasMaxLength
sh:MinExclusiveConstraintComponent    dash:hasMinExclusive 
dash:hasMinExclusive
sh:MinInclusiveConstraintComponent    dash:hasMinInclusive 
dash:hasMinInclusive
sh:MinLengthConstraintComponent    dash:hasMinLength 
dash:hasMinLength    dash:hasMinLength
sh:NodeKindConstraintComponent    dash:hasNodeKind dash:hasNodeKind    
dash:hasNodeKind
sh:PatternConstraintComponent    dash:hasPattern dash:hasPattern    
dash:hasPattern
sh:StemConstraintComponent    dash:hasStem    dash:hasStem dash:hasStem

The remaining 15 ones are heterogeneous and do not easily fit into that 
scheme:

- sh:DisjointConstraintComponent
- sh:LessThanConstraintComponent
- sh:LessThanOrEqualsConstraintComponent
These look like they could be turned into ASK queries, so please 
consider them in the category above.

- sh:AndConstraintComponent
- sh:NotConstraintComponent
- sh:OrConstraintComponent
- sh:ShapeConstraintComponent
These have SELECT queries because the ASK schema (currently) does not 
support handling of the ?failure variable. I cannot tell yet how common 
the ?failure handling will be and whether we need to come up with a 
different syntax for them. The problem is that ASK can only return true 
or false, but not return a thirs value, and there is no "exception" 
reporting in SPARQL.

- sh:ClosedConstraintComponent
This uses a SELECT query because the result variable ?predicate is 
different each time and needs to be computed as part of the WHERE clause.

- sh:EqualsConstraintComponent
This is a SELECT query because it requires two branches in a UNION. The 
boilerplate would not work IMHO.

- sh:HasValueConstraintComponent
This does not fit into the ASK schema. While theoretically it would be 
possible to use

     ASK { FILTER sameTerm(?value, $hasValue) }

for node constraints, the performance of this would be prohibitively 
slow for the predicate-based constraints. The query in those cases looks 
very different:

     SELECT $this ($this AS ?object) $predicate
     WHERE {
         FILTER NOT EXISTS { $hasValue $predicate $this }
     }

Furthermore, this is an existential FILTER that does not follow the 
"usual" pattern.

- sh:MaxCountConstraintComponent
- sh:MinCountConstraintComponent
- sh:QualifiedMaxCountConstraintComponent
- sh:QualifiedMinCountConstraintComponent
Use yet another pattern, where there is either a HAVING clause with an 
aggregation, or a nested query.

- sh:UniqueLangConstraintComponent
This is a SELECT query because the ?lang is also being returned so that 
it can be used in the sh:message. Also, there should only be one 
validation result per ?lang, and thus it needs to be turned into a 
SELECT DISTINCT.

*So among the 12 constraint components that are currently not covered by 
ASKs, there are already 6 different design patterns.*

And we have not even started to look into extensions. Whatever further 
generalization we would come up with will almost certainly limit the 
expressivity of SHACL to only a subset of SPARQL, and this would be a 
show stopper.

And then we have not even started to look into other extension languages 
like JavaScript... The current infrastructure is set up so that each 
case can have multiple validators, in multiple languages. A 
JavaScript-based implementation will likely not use SPARQL but instead 
have completely different code paths to walk the objects being validated.

Having thought about all these topics for many months now, my conclusion 
is that we will continue to need the flexibility of multiple validators 
for the different cases. In a large number of cases a single ASK query 
will be sufficient for all three cases. And then in a further large 
number of cases, people will only need to develop one query for node 
constraints, and one for path-based constraints. I am convinced that 
this will be acceptable (assuming you agree we should support paths - 
your own proposal had them).


>
>
> Describing all constraint components in a similar fashion is also desirable
> to describing them differently.  Right now some constraint components, e.g.,
> sh:class, are described using the notion of value nodes but others, e.g.,
> sh:minCount, are described using focus nodes and predicates even when the
> effect is the same as value nodes.

You keep bringing up the same first paragraphs in the spec :) These are 
usually just editorial left-overs from the olden days when the spec was 
just in one direction. These are easy to fix once spotted:

https://github.com/w3c/data-shapes/commit/66525d7f3f784822806f5d74e54818206d01d6ef

Can you find any more such examples? They are just editorial mistakes.

>    Regularizing the way that constraint
> components are described would reduce the number of ways that errors can
> creep into the document and also reduce the cognitive load on readers of the
> document.  Describing constraint components in terms of value nodes also
> better shows the commonalities amongst them.
>
> It is currently possible to have a constraint component that works
> completely differently when it is in a node constraint from when it is in a
> property constraint and from when it is in an inverse property constraint.
> Using the notion of value nodes produces a force against this divergence.

Agreed.

>
>
> These issues arise from having all constraint components sit inside the
> three different kinds of constratins and having each constraint component
> being responsible for its own determination of value nodes.  There are
> different approaches to SHACL that would eliminate these issues.  ShEx has a
> single property-crossing construct and all other constructs in triple
> expressions are not concerned with properties.  OWL has several
> property-crossing constructs but most constructs in OWL work on individual
> value nodes.  My refactored SHACL syntax has a single property-crossing
> construct and all constructs work on sets of value nodes.

I have explained above why there are differences in the queries, and why 
these differences are important (e.g. in the case of sh:hasValue). While 
I share your desire to further generalize and clean up the language, 
there are limits where this becomes impractical or would otherwise limit 
the expressive power of what customers will want to do. And looking back 
at many years of working with customers and SPIN, the only thing we can 
predict is that we cannot predict the variety of use cases. We need to 
design the language to cater for this flexibility, not make premature 
assumptions on the limited set of examples that happen to be in the Core 
Vocabulary.

Thanks,
Holger


>
> peter
>
>
> On 06/02/2016 10:13 PM, Holger Knublauch wrote:
>> Could you help me understand why we should do this? All I am seeing is that
>> this would add complexity to the language, add development costs for these
>> additional cases, increase our burden to specify and write test cases for all
>> these scenarios, for the "benefit" that people can apply entirely useless
>> constructs such as minCount with node constraints or datatypes for subjects
>> which can never be literals.
>>
>> Furthermore, deleting the concept of sh:context makes it impossible for tools
>> to determine under which conditions a constraint component should be offered.
>> The forms that I have implemented would display every constraint property on
>> every case - node constraints, property constraints, inverse property
>> constraints. This is not user friendly!
>>
>> Finally, every extension developer is forced to specify SPARQL queries for all
>> cases, even if they make no sense (like most of the cases below). Some of the
>> queries that you have written up are completely different from their other
>> variations. How can you be sure that the same generalization is sensible for
>> every possible future extension?
>>
>> As a random example consider one of the original Use cases: specifying a
>> primary key. These are only ever meant to be used for properties, neither
>> inverses nor node constraints nor paths.
>>
>> https://www.w3.org/TR/shacl-ucr/#uc25-primary-keys-with-uri-patterns
>>
>> I must be missing something, but this is a massive step backwards and a
>> serious risk to the success of SHACL. There is nothing broken right now with
>> the context mechanism. Why change it?
>>
>> Thanks,
>> Holger
>>
>>
>> On 3/06/2016 7:19, Peter F. Patel-Schneider wrote:
>>> To think about how a constraint component works universally, it is
>>> sufficient to think about value nodes, which are already defined at the
>>> beginning of Section 4.
>>>
>>> So, sh:hasValue is then just that a value node is the given node and
>>> sh:equals is just that the set of value nodes is the same as the set of
>>> values for the focus node for the other property and sh:closed is just that
>>> every value node has no values for disallowed properties and sh:minCount is
>>> just that there are at least n value nodes.
>>>
>>>
>>> Looking at https://github.com/TopQuadrant/shacl the changes to permit core
>>> constraint components to be used universally appear to be as follows:
>>>
>>> 1/ Ensure that sh:context has all three relevant values for each constraint
>>> component.  (Of course then sh:context becomes irrelevant and can be
>>> removed.)
>>>
>>> 2/ For the constraint component for:
>>>
>>> sh:closed add
>>>     sh:propertyValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "Predicate {?unallowed} is not allowed on {?subject} (closed
>>> shape)" ;
>>>         sh:sparql """
>>>          SELECT ?this (?val AS ?subject) ?unallowed ?object
>>>          WHERE {
>>>              {
>>>                  FILTER ($closed) .
>>>              }
>>>              $this $predicate ?val .
>>>              ?val ?unallowed ?object .
>>>              FILTER (NOT EXISTS {
>>>                  GRAPH $shapesGraph {
>>>                      $currentShape sh:property/sh:predicate ?unallowed .
>>>                  }
>>>              } && (!bound($ignoredProperties) || NOT EXISTS {
>>>                  GRAPH $shapesGraph {
>>>                      $ignoredProperties rdf:rest*/rdf:first ?unallowed .
>>>                  }
>>>              }))
>>>          }
>>> """ ;
>>> Similar for inverse property constraint.
>>> sh:closed should also be implementable using the simple form (like
>>> sh:datatype and sh:minExclusive are).
>>>
>>> sh:datatype    add dash:hasDatatype as a value for sh:inversePropertyValidator
>>> sh:datatypeIn    add dash:hasDatatypeIn as a value for
>>> sh:inversePropertyValidator
>>>
>>> sh:hasValue    add
>>>     sh:nodeValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "Node is not value {$hasValue}" ;
>>>         sh:sparql """
>>>          SELECT $this
>>>          WHERE {
>>>              FILTER { NOT sameTerm($this,$hasValue) }
>>>          }
>>>          """ ;
>>>       ] ;
>>>
>>> sh:disjoint add
>>>     sh:inversePropertyValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "Inverse of property must not share any values with
>>> {$disjoint}" ;
>>>         sh:sparql """
>>>          SELECT $this ($this AS ?object) $predicate ?subject
>>>          WHERE {
>>>              ?subject $predicate $this .
>>>              ?subject $disjoint $this  .
>>>          }
>>>          """ ;
>>>       ] ;
>>>     sh:nodeValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "Node must not be a value of {$disjoint}" ;
>>>         sh:sparql """
>>>          SELECT $this
>>>          WHERE {
>>>              $this $disjoint ?this .
>>>          }
>>>          """ ;
>>>       ] ;
>>>
>>> sh:equals add
>>>     sh:inversePropertyValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "Inverse of property must have same values as {$equals}" ;
>>>         sh:sparql """
>>>          SELECT $this ($this AS ?object) $predicate ?subject
>>>          WHERE {
>>>              {
>>>                  ?subject $predicate $this .
>>>                  FILTER NOT EXISTS {
>>>                      ?subject $equals $this  .
>>>                  }
>>>              }
>>>              UNION
>>>              {
>>>                  ?subject $equals $this .
>>>                  FILTER NOT EXISTS {
>>>                      ?subject $predicate $this .
>>>                  }
>>>              }
>>>          }
>>>          """ ;
>>>       ] ;
>>>     sh:nodeValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "Node must be a value of {$equals}" ;
>>>         sh:sparql """
>>>          SELECT $this
>>>          WHERE {
>>>              FILTER NOT EXISTS { $this $disjoint $this }
>>>          }
>>>          """ ;
>>>       ] ;
>>>
>>> sh:lessThan add
>>>     sh:InversePropertyValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "Inverse property value is not < value of {$lessThan}" ;
>>>         sh:sparql """
>>>          SELECT $this ($this AS ?object) $predicate ?subject
>>>          WHERE {
>>>              ?subject $predicate $this  .
>>>                $this $lessThan ?object2  .
>>>              FILTER (!(?subject < ?object2)) .
>>>          }
>>>          """ ;
>>>       ] ;
>>>     sh:nodeValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "Node is not < value of {$lessThan}" ;
>>>         sh:sparql """
>>>          SELECT $this
>>>          WHERE {
>>>              $this $lessThan ?object2 .
>>>              FILTER (!(?this < ?object2)) .
>>>          }
>>>          """ ;
>>>       ] ;
>>>
>>> sh:lessThanOrEquals similar
>>>
>>> sh:minCount add
>>>     sh:nodeValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "Node is precisely one value, not {$minCount}" ;
>>>         sh:sparql """
>>>          SELECT $this
>>>          WHERE {
>>>              FILTER ( 1 >= $minCount) .
>>>          }
>>>          """ ;
>>>       ] ;
>>>
>>> sh:maxCount similar
>>>
>>> sh:maxExclusive    add dash:hasMaxExclusive as a value for
>>> sh:inversePropertyValidator
>>>
>>> sh:maxInclusive    add dash:hasMaxInclusive as a value for
>>> sh:inversePropertyValidator
>>>
>>> sh:minExclusive    add dash:hasMinExclusive as a value for
>>> sh:inversePropertyValidator
>>>
>>> sh:minInclusive    add dash:hasMinInclusive as a value for
>>> sh:inversePropertyValidator
>>>
>>> sh:uniqueLang add
>>>     sh:inversePropertyValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "Language {?lang} used more than once" ;
>>>         sh:sparql """
>>>          SELECT DISTINCT $this ($this AS ?object) $predicate ?lang
>>>          WHERE {
>>>              {
>>>                  FILTER ($uniqueLang) .
>>>              }
>>>              ?value $predicate $this .
>>>              BIND (lang(?value) AS ?lang) .
>>>              FILTER (bound(?lang) && ?lang != \"\") .
>>>              FILTER EXISTS {
>>>                  $this $predicate ?otherValue .
>>>                  FILTER (?otherValue != ?value && ?lang = lang(?otherValue)) .
>>>              }
>>>          }
>>>          """ ;
>>>       ] ;
>>>     sh:nodeValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:message "A language used more than once on node" ;
>>>         sh:sparql """
>>>          SELECT $this
>>>          WHERE { FILTER ( 1 = 0 )
>>>          }
>>>          """ ;
>>>       ] ;
>>>
>>> sh:qualifiedMinCount add
>>>     sh:nodeValidator [
>>>         rdf:type sh:SPARQLSelectValidator ;
>>>         sh:sparql """
>>>          SELECT $this ($this AS ?subject) $predicate ?count ?failure
>>>          WHERE {
>>>              BIND (sh:hasShape(?subject, $valueShape, $shapesGraph) AS
>>> ?hasShape) .
>>>              BIND (!bound(?hasShape) AS ?failure) .
>>>              FILTER IF(?failure, true, ?count > IF(?hasShape,1,0))
>>>          }
>>> """ ;
>>>       ] ;
>>>
>>> sh:qualifiedMaxCount similar
>>>
>>>
>>> Note that none of these are difficult to do, particularly when looking at
>>> the another validator for the same component.  This should be true for any
>>> constraint component that can be described as working on the value nodes.  I
>>> think that all constraint components should be describable this way.
>>>
>>>
>>> peter
>>>
>>
Attachments

text/plain attachment: dash.ttl
Received on Saturday, 4 June 2016 00:15:24 UTC