Re: update to SHACL-SPARQL (ISSUE-62) from Holger Knublauch on 2015-06-01 (public-data-shapes-wg@w3.org from June 2015)

From: Holger Knublauch <holger@topquadrant.com>
Date: Tue, 02 Jun 2015 09:27:14 +1000
To: public-data-shapes-wg@w3.org
Message-ID: <556CEA52.803@topquadrant.com>
(Sorry for the long email, this requires detailed examination)

On 6/1/2015 21:14, Peter F. Patel-Schneider wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
>
>
> On 05/31/2015 11:12 PM, Holger Knublauch wrote:
>> On 5/31/2015 0:26, Peter F. Patel-Schneider wrote:
>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
>>>
>>> To make selection by expression work well, the translations to SPARQL
>>> need to be adjusted to make most of them binding.  I have done so for
>>> https://www.w3.org/2014/data-shapes/wiki/Shacl-sparql
>>>
>>> As I also state in
>>> https://www.w3.org/2014/data-shapes/wiki/Shacl-sparql it is possible to
>>> do away with the need for binding by adding some SPARQL that binds
>>> against all nodes in an RDF graph.
>> The latter is basically what I suggest too, only that I would make this
>> fact more explicit. In my suggested design, people would associate their
>> shapes with rdfs:Resource to state that they apply to all resources. (We
>> can discuss whether this means everything with an rdf:type statement or
>> every subject in the graph, but that doesn't matter too much). This
>> ensures that ?this will always be bound, but doesn't prescribe how an
>> engine implements this: if the constraint already binding then it doesn't
>> need to add the ?this a rdfs:Resource clause prior to execution, leading
>> to exactly the same situation as in your approach.
> Using rdfs:Resource either requires a dependence on RDFS reasoning or a
> special case for this class.  I'm fine with the former, but that might not
> go well with the working group.  I'm not fine with the latter.

If I understand your own suggestion correctly, then your approach would 
inject code to iterate over all instances of rdfs:Resource to make sure 
that ?this is bound, so you seem to be doing the same special case like 
I do.

A question we need to answer would be: if a scope does not produce 
bindings, what shall happen:
1) Iterate over all resources in the graph
2) Iterate over all resources that have an rdf:type
3) Iterate over all subjects in the graph

(I would favor option 2 but could live with 3 too)

>
>> However, I believe my approach is cleaner because it handles every case
>> consistently, without having to specify some algorithms that explain in
>> detail how to determine whether a SPARQL query is already binding, and
>> how to inject a binding clause otherwise. It is also more consistent for
>> the case when you invoke the engine to validate a single resource - it
>> would simply walk up the class hierarchy to collect all relevant
>> constraints and wouldn't need to look into some global constraint objects
>> outside of the tree.
> As I indicate above, I don't view this approach as clean, unless you depend
> on RDFS reasoning.   Without RDFS reasoning or having a special case for
> rdfs:Resource, even if you use the approach of walking up rdfs:subClassOf
> links you will only get classes that have an explicit ancestor of
> rdfs:Resource and nodes that have an rdf:type link.

IMHO the situation is a historically inherited mess produced by RDFS and 
OWL. It was a mistake to allow named classes that have no explicit named 
superclass. This always forced some type of inferencing upon everyone, 
and did so completely unnecessarily. Anyway, it's a situation we need to 
live with now.

I believe walking of subClassOf triples will also work when a class 
doesn't explicitly root in rdfs:Resource (e.g. because it only roots in 
owl:Thing but the owl:Thing rdfs:subClassOf rdfs:Resource triple isn't 
there). The validateNode operation [1] currently states

(?focusNode rdf:type/rdfs:subClassOf* ?type)

but this could be generalized to ensure that rdfs:Resource is always 
added to the set of ?types, and possibly owl:Thing for any owl:Class.

>
>> Overall I really wonder what use cases would not be covered by my design
>> but yours... We had discussed before that other communities may define
>> their own shape selectors anyway.
> All scoping can be done inside global constraints, so there is no need for
> scoping by expression.  The differences then are mostly stylistic---if you
> want a constraint on nodes that have a filler of :verified for :status, do
> you want to do
>
> [ rdf:type sh:Shape;
>    sh:scope [ sh:property :status ;
>               sh:hasValue :verified ] ;
>    sh:constraint ... ]
>
> or
>
> [ rdf:type sh:Shape;
>    sh:classScope rdfs:Resource ;
>    sh:constraint [sh:or ( [ sh:not [ sh:property :status ;
>                                      sh:hasValue :verified ] ]
>                            ... ) ] ]

It is good to look at examples. The second option above is obviously not 
what we are talking about here, but let's look at a few situations:

ex:MyShape
     sh:scopeShape [
         sh:property [
             sh:predicate ex:status ;
             sh:hasValue ex:verified ;
         ]
     ] ;
     sh:constraint ... something that applies to all things with 
status=verified ...

The sh:sparql of sh:hasValue is currently basically

     FILTER NOT EXISTS { ?this ?predicate ?hasValue } .

which doesn't bind ?this because it only returns true for values that do 
*not* match. However, we could theoretically add an inverse expression, e.g.

sh:AbstractHasValuePropertyConstraint
     sh:sparqlSelector """
         ?this ?predicate ?hasValue
     """ .

which could then be injected into the generated SPARQL query to make 
sure that ?this is bound. The obvious downside here is that we would be 
forcing every SPARQL template to have two SPARQL queries - the selector 
one only used for certain (rare?) cases. I don't see how we could sell 
this idea to users!


Another example:

ex:MyShape
     sh:scopeShape [
         sh:constraint [
             a ex:NamespaceConstraint ;
             ex:namespace "http://example.org/ns#" ;
         ]
     ] ;
     sh:constraint ... something that applies to all resources from the 
given namespace ...

The sh:sparql behind ex:NamespaceConstraint would be something like

     FILTER STRSTARTS(str(?this), ?namespace) .

for which it is basically impossible to find an inverse selector 
expression, so you would need to fall back to options 1) 2) or 3) above, 
e.g. generate

     ?this a ?anyType .
     FILTER STRSTARTS(str(?this), ?namespace) .

We would have an inconsistency in that *certain* queries would fall back 
to our 1,2,3 policy while others don't need this and may return 
completely different bindings for ?this, even resources that are not 
mentioned in the graph at all.

My preferred design is:

ex:MyShape
     sh:scopeClass rdfs:Resource ;
     sh:filterShape [   # was: sh:scopeShape
         sh:constraint [
             a ex:NamespaceConstraint ;
             ex:namespace "http://example.org/ns#" ;
         ]
     ] ;
     sh:constraint ... something that applies to all resources from the 
given namespace ...

i.e. enforce the consistent policy that sh:filterShape can only be used 
with another selector such as sh:scopeClass. Again, smart engines can 
still easily optimize this check away and should certainly optimize the 
common cases so that the hasValue matching is done first, and the 
rdf:type check only done last, but that detail is out of scope of the spec.

I believe we will have a much easier job with the spec and to explain 
the situation to users if we separate between scopes and filters. Plus 
this design basically covers the same use cases as yours - it's arguably 
more a discussion about syntax than fundamental differences.

Holger

[1] http://w3c.github.io/data-shapes/shacl/#operation-validateNode
Received on Monday, 1 June 2015 23:37:37 UTC