Re: SHACL target extension from Holger Knublauch on 2020-06-05 (public-shacl@w3.org from June 2020)

From: Holger Knublauch <holger@topquadrant.com>
Date: Fri, 5 Jun 2020 12:30:11 +1000
To: public-shacl@w3.org
Message-ID: <b32b813b-1db2-d60f-97c7-39f53b70a11d@topquadrant.com>
Hi Vladimir,

from a specification point of view I see no show stoppers to introducing 
such a mechanism. I would however introduce a new property instead of 
sh:target, because the meaning of sh:target would otherwise be 
overloaded and it is possible for targets to also be sh:NodeShapes in 
which case the result will be very surprising. So, IMHO it should be 
something like sh:targetShape (or the earlier, verbose 
sh:targetNodesConforming).

 From a practical point of view, I remain very nervous about performance 
implications. It will be too easy for users to produce some really 
inefficient scenarios where any implementation almost certainly must 
iterate over all nodes in the whole graph.E.g. sh:targetShape [ 
sh:datatype xsd:string ] requires walking through all existing objects 
in the graph, likewise something with sh:languageIn or sh:pattern.

If we offer such a feature then we may invite disappointment from users, 
and statements such as "SHACL is slow". Sometimes less is more. Note 
that any sh:targetShape statement means that even a simple check such as 
"is node N in the target of S" requires iterating over all 
sh:targetShapes each time. This can be very expensive.

The implementation cost of this feature is significant, because it 
requires the implementation of an "inverse validation" algorithm.  
Validation starts with a focus node and returns a result. The inverse 
would start with the shape and has to discover the valid focus nodes. 
For example, in the case of sh:targetShape [ sh:class X ; sh:property [ 
sh:path p ; sh:hasValue Z ] ] an algorithm has the choice between first 
looping over all instances of X and then checking if they have Z or vice 
versa. Yes, it's an opportunity for developing interesting algorithms, 
and such an inverse validation algorithm would be beneficial and 
interesting for many use cases anyway. I personally can at the moment 
not commit time for such an algorithm so I would, in order to fulfill 
such a spec, introduce a painfully slow brute-force algorithm. Other 
implementers may be in the same boat, raising the bar for implementers 
significantly.

Meanwhile, SPARQL-based targets already exist, and give users control 
over how efficient the implementation will be able to understand them. 
For example, such a target could just be "SELECT ?this WHERE { ?this 
ex:nationality ex:Norwagian }" and any off-the-shelf SPARQL engine can 
be used to evaluate that.

So while I agree with the use case, and the fact that this might be more 
direct than sh:filterShape (which has its own problems), I am quite 
nervous that we are over-promising here.

Do you guys already have implementations of such inverse validation 
algorithms?

---

Here is another thought: looking through the Core constraint types, I 
guess most of them are hard to execute in the inverse order: 
sh:datatype, sh:nodeKind, sh:minExclusive etc, sh:minLength etc, 
sh:pattern, sh:languageIn, sh:uniqueLang, sh:lessThan etc, sh:closed, 
and also the XYcount ones all basically require walking through all 
subjects and objects in the graph. However, the following are quite easy 
to revert:

- sh:class (= sh:targetClass)
- sh:hasValue
- sh:in

So what if we simply introduce a new target type sh:targetHasValue V 
where the targets can be identified by a direct look-up. For example

ex:KiwiShape
     sh:targetHasValue [
         sh:path ex:nationality ;
         sh:hasValue ex:NewZealand ;
     ] ; ...

which amounts to asking ?this ex:nationality ex:NewZealand which is 
super fast and covers both sh:hasValue and (to lesser extent) sh:in use 
cases. In fact, such a thing can be easily expressed as a SHACL-SPARQL 
target type already, and the syntax could be

ex:KiwiShape
     sh:target [
         a dash:HasValueTarget ;
         dash:predicate ex:nationality ;
         dash:value ex:NewZealand ;
     ] ; ...

and the underlying SPARQL query would be

SELECT ?this
WHERE {
     ?this $predicate $value .
}

This wouldn't cover all use cases mentioned here, but at least covers 
the hasValue scenario, and nothing new needs to be implemented or added 
to the spec.

Holger


On 4/06/2020 19:31, Vladimir Alexiev wrote:
> Hi everyone! (This email is formatted as markdown)
>
> I have 2 objections to earlier proposals:
> - According to 
> https://www.w3.org/TR/shacl-af/#node-expressions-filter-shape,
>   `sh:filterShape` is always used with `$this` as seed and `sh:nodes` 
> as generator.
>   So I don't think it can be used for our case.
> - It seems wrong to me to use `sh:target` and `sh:filterShape` in a 
> disconnected manner
>   (the former with just marker classes, the latter to carry the actual 
> target shape)
>
> I thought more about what Holger called `sh:targetNodesConforming`, 
> and I think what we need already exists: target by `NodeShape`.
> So I think we only need to add a new subsection of 
> https://www.w3.org/TR/shacl-af/#targets but no new classes or properties.
>
> > Separating sh:AllSubjects and sh:AllObjects separately would offer 
> more flexibility too
>
> Both subjects and objects are Nodes in the graph.
> I think `NodeShape` already gives us enough flexibility to select one 
> or the other
> (there are 2 related examples below: selecting by IRI pattern, and 
> selecting langString literals).
> Just like we don't have distinct `SubjectNodeShape` vs `ObjectNodeShape`,
> I don't think we need such distinction for targeting either.
>
> Below is a proposal for such new subsection, please comment.
>
> # NodeShape Targets
>
> Sometimes it is useful to find nodes by shape, and then validate them 
> using another shape.
> To do this, you can use `sh:target` that is a `sh:NodeShape`:
>
> ```
> ex:MyNodeShape a sh:NodeShape;
>   sh:target [a sh:NodeShape;
>     <NodeShape constructs for target>
>   ];
>   <NodeShape constructs for validation>
> .
> ```
>
> In the following subsections we show several examples of this design.
>
> ## Target by Property and Object
>
> Norwegians must have one norwegianID:
>
> ```
> ex:NorwegianShape a sh:NodeShape;
>   sh:target [a sh:NodeShape;
>     sh:property [sh:path ex:nationality; sh:hasValue ex:Norway];
>   ];
>   sh:property [sh:path ex:norwegianID; sh:minCount 1; sh:maxCount 1];
> .
> ```
>
> ## Target Namespace Instances
>
> All instances in a given namespace must have a certain shape:
>
> ```
> ex:CompanyShape a sh:NodeShape;
>   sh:target [a sh:NodeShape;
>     sh:nodeKind sh:IRI;
>     sh:pattern "^https://company-graph.ontotext.com/resource/company/";
>   ];
>   sh:class ex:Company;
>   sh:property [sh:path dc:type; sh:in ("conglomerate" "collective" 
> "enterprise")];
> .
> ```
>
> ## Target All langStrings
>
> All langStrings must have one of a predefind set of languages:
>
> ```
> ex:langStringShape a sh:NodeShape;
>   sh:target [a sh:NodeShape;
>     sh:datatype rdf:langString;
>   ];
>   sh:languageIn ("en" "bg");
> .
> ```
>
> ## Target By Cardinality
>
> Let's say a person Steve is very popular, so everyone who knows at 
> least three people must know Steve:
> ```
> ex:Personshape a sh:NodeShape;
>   sh:target [a sh:NodeShape;
>     sh:property [sh:path foaf:knows; sh:minCount 3];
>   ];
>   sh:property [sh:path foaf:knows; sh:hasValue ex:Steve];
> .
> ```
>
> ## Semantic Type Discrimination
>
> In some datasets, instances are not discriminated by `rdf:type` alone, 
> but also by other traits.
> Often more than one check needs to be performed.
>
> Eg in Geonames, all instances have type `gn:Feature`, and are further 
> discriminated by `gn:featureCode`.
> That's a 2-level classification of some 650 codes that includes 
> everything from continents to mountains to pipelines to hotels.
>
> Imagine that you're interested only in countries and top-level 
> administrative divisions (states, provinces and the like).
> - A bunch of codes correspond to the concept "country"
> - Countries have `gn:countryCode`
> - Only the code `gn:ADM1` corresponds to top-level administrative 
> divisions
> - Administrative divisions have `gn:parentCountry`
> (This does not describe all Geonames fields, only the ones that we need.)
>
> ```
> gn:Feature a sh:NodeShape, rdf:Class;
>   # implicit: sh:targetClass gn:Feature;
>   sh:property [sh:path gn:name;         sh:datatype xsd:string; 
> sh:minCount 1; sh:maxCount 1];
>   sh:property [sh:path gn:featureClass; sh:nodeKind sh:IRI; 
> sh:minCount 1; sh:maxCount 1];
>   sh:property [sh:path gn:featureCode;  sh:nodeKind sh:IRI; 
> sh:minCount 1; sh:maxCount 1];
> .
>
> ex:CountryShape a sh:NodeShape;
>   sh:target [a sh:NodeShape;
>     sh:class gn:Feature;
>     sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD 
> gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
>   ];
>   sh:property [sh:path gn:countryCode; sh:datatype xsd:string; 
> sh:minCount 1; sh:maxCount 1];
> .
>
> ex:ADM1Shape a sh:NodeShape;
>   sh:target [a sh:NodeShape;
>     sh:class gn:Feature;
>     sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
>   ];
>   sh:property [sh:path gn:parentCountry; sh:node ex:CountryShape; 
> sh:minCount 1; sh:maxCount 1];
> .
> ```
>
> ## Targeting and Reference Shapes
>
> In the last example we stated that `gn:parentCountry` must point to 
> something that satisfies `ex:CountryShape`.
> This means that every time we validate `ex:ADM1Shape`, we need to 
> validate its country (together with the country-specific properties).
> So the validation of ADM1 must recurse into validation of Country.
>
> This is not always convenient since it's hard to control this 
> recursive process.
> Furthermore, if Country referred back to `ex:ADM1Shape` of its 
> regions, we'd have a recursive shape and the result would be undefined.
>
> It may therefore be more convenient to check only the **existence** of 
> Country from ADM1,
> and depend that some other process will check the validity of Country.
> We could do it like this:
>
> ```
> ex:CountryReferenceShape a sh:NodeShape;
>   sh:class gn:Feature;
>   sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD 
> gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
> .
>
> ex:CountryShape a sh:NodeShape;
>   sh:target ex:CountryReferenceShape;
>   sh:property [sh:path gn:countryCode; sh:datatype xsd:string; 
> sh:minCount 1; sh:maxCount 1];
> .
>
> ex:ADM1ReferenceShape a sh:NodeShape;
>   sh:class gn:Feature;
>   sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
> .
>
> ex:ADM1Shape a sh:NodeShape;
>   sh:target ex:ADM1ReferenceShape;
>   sh:property [sh:path gn:parentCountry; sh:node 
> ex:CountryReferenceShape; sh:minCount 1; sh:maxCount 1];
> .
> ```
>
> The significant change is in the last line: ADM1 checks 
> `ex:CountryReferenceShape` rather than `ex:CountryShape`.
> And we reuse `ex:CountryReferenceShape` as both:
> - Existence check in `ex:ADM1Shape`
> - Targeting shape in `ex:CountryShape`
>
> ## Politicians and Parties
>
> Let's say every Party has at least one Politician,
> every Politician belongs to exactly one Party (ok, that is unrealistic),
> politicians are defined by a combination of `rdf:type` and `dc:type`,
> and both Parties and Politicians adhere to one of two politics 
> (Democrat vs Republican).
>
> If we model this with two shapes that refer to each other, we'd have 
> recursive shapes.
> So again we use two shapes for every entity:
> - A "smaller" ReferenceShape that just checks existence in terms of 
> "semantic type discrimination"
> - A "bigger" Shape that checks all other properties of the instance, 
> and uses the ReferenceShape for targeting
>
> This eliminates the recursion.
>
> ```
> ex:PoliticianReferenceShape a sh:NodeShape;
>   sh:property [sh:path rdf:type; sh:in (foaf:Person dbo:Person)];
>   sh:property [sh:path dc:type; sh:hasValue "politician"];
> .
> ex:PoliticianShape a sh:NodeShape;
>   sh:target ex:PoliticianReferenceShape;
>   sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")];
>   sh:property [sh:path ex:party; sh:node ex:PartyReferenceShape; 
> sh:minCount 1; sh:maxCount 1];
> .
> ex:PartyReference a sh:NodeShape;
>   sh:property [sh:path rdf:type; sh:hasValue foaf:Organization];
>   sh:property [sh:path dc:type; sh:hasValue "political party"];
> .
> ex:PartyShape a sh:NodeShape;
>   sh:target ex:PartyReferenceShape;
>   sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")];
>   sh:property [sh:path ex:politician; sh:node 
> ex:PoliticianReferenceShape; sh:minCount 1];
> .
> ```
Received on Friday, 5 June 2020 02:30:31 UTC