Re: SHACL target extension from Håvard Ottestad on 2020-06-05 (public-shacl@w3.org from June 2020)

From: Håvard Ottestad <hmottestad@gmail.com>
Date: Fri, 5 Jun 2020 10:45:39 +0200
To: Holger Knublauch <holger@topquadrant.com>
Cc: Public Shacl W3C <public-shacl@w3.org>
Message-Id: <BF410851-E9DA-47CF-AE72-9D78C0F442C5@gmail.com>
For us it’s about being able to provide the flexibility of the SPARQL based targets, without the performance penalties. Shape based targets would at worst be as slow as SPARQL based targets, but the best case performance is miles ahead.

Same goes for SPARQL based constraints. 

Håvard

> On 5 Jun 2020, at 10:32, Holger Knublauch <holger@topquadrant.com> wrote:
> 
> 
> 
> 
> On 5/06/2020 18:00, Håvard Ottestad wrote:
>> We are planning on generating a single SPARQL query for that case. We haven't started working on this yet. Our plan is to have two approaches: we analyze the transaction and estimate the cost of a "transactional" validation and a "full SPARQL" validation and run whichever is faster for that particular transaction.
>> 
>> Wouldn't dash:AllSubject and dash:AllObjects be just as slow, or sh:targetClass rdfs:Resource for that matter? How would these be optimized better than a target shape representing "all objects that match a regex pattern"?
> Yes that would be just as slow. I am not advocating that. Just wanted to point out that there are risks of offering features that are too overpowering and thus very difficult for us implementers to cover properly.
> 
> Holger
> 
> 
> 
>> 
>> Håvard
>> 
>> On Fri, Jun 5, 2020 at 9:31 AM Holger Knublauch <holger@topquadrant.com> wrote:
>>> 
>>> On 5/06/2020 17:02, Håvard Ottestad wrote:
>>>> >Ok, does this apply to the case where you have a target shape and want 
>>>> to find all nodes in the graph that conform?
>>>> 
>>>> Yes. All those are trivial as target shapes. 
>>>> 
>>>> For the example below the data added by the user becomes the starting point
>>> Ok, that's of course easier because you already have a small subset of node. But then we are not talking about the same use case. What happens if you need to run the full validation of the full graph? E.g. someone puts a sh:pattern on rdfs:label and there are (which is realistic) millions of labels in the database?
>>> 
>>> Holger
>>> 
>>> 
>>> 
>>>> for the validation. A target is either added in this transaction, in which case we retrieve all its foaf:age paths and validate those. Or a path is added to an existing target, in which case we have a node to start on (the subject of the path).
>>>> 
>>>> ex:CompanyShape a sh:NodeShape;
>>>>   sh:target [a sh:NodeShape;
>>>>     sh:nodeKind sh:IRI;
>>>>     sh:pattern "^https://company-graph.ontotext.com/resource/company/";
>>>>   ]; 
>>>>   sh:property [sh:path foaf:age; sh:datatype xsd:integer ];
>>>> .
>>>> 
>>>> Håvard
>>>> 
>>>> On Fri, Jun 5, 2020 at 8:28 AM Holger Knublauch <holger@topquadrant.com> wrote:
>>>>> 
>>>>> On 5/06/2020 15:49, Håvard Ottestad wrote:
>>>>> > Hi,
>>>>> >
>>>>> > Just a quick response performance wise.
>>>>> >
>>>>> > SPARQL targets are very slow because the RDF4J ShaclSail can’t analyze a transaction to decide what to validate. Shape based targets on the other hand can be used to generate a validation plan that utilizes the changeset of the transaction to only validate a small subset of the data.
>>>>> The design of named SPARQL targets means that if a name gets established 
>>>>> (e.g. as a de-facto standard) then an engine may hard-code it. However, 
>>>>> the SPARQL remains as a fallback.
>>>>> >
>>>>> > The more complex the target shape the larger this subset becomes and the more data needs to be considered.
>>>>> >
>>>>> > For us sh:datatype, sh:nodeKind, sh:minExclusive etc, sh:minLength etc, sh:pattern, sh:languageIn, sh:uniqueLang are actually trivial to validate when used in single predicate path shapes.
>>>>> 
>>>>> Ok, does this apply to the case where you have a target shape and want 
>>>>> to find all nodes in the graph that conform?
>>>>> 
>>>>> Holger
>>>>> 
>>>>> 
>>>>> >
>>>>> > We are currently supporting sh:hasValue, sh:or, sh:and, sh:property and sh:path as long as the effective path is a single predicate (so no nested sh:property).
>>>>> >
>>>>> > Håvard
>>>>> >
>>>>> >> On 5 Jun 2020, at 04:30, Holger Knublauch <holger@topquadrant.com> wrote:
>>>>> >>
>>>>> >> Hi Vladimir,
>>>>> >>
>>>>> >> from a specification point of view I see no show stoppers to introducing such a mechanism. I would however introduce a new property instead of sh:target, because the meaning of sh:target would otherwise be overloaded and it is possible for targets to also be sh:NodeShapes in which case the result will be very surprising. So, IMHO it should be something like sh:targetShape (or the earlier, verbose sh:targetNodesConforming).
>>>>> >>
>>>>> >>  From a practical point of view, I remain very nervous about performance implications. It will be too easy for users to produce some really inefficient scenarios where any implementation almost certainly must iterate over all nodes in the whole graph.E.g. sh:targetShape [ sh:datatype xsd:string ] requires walking through all existing objects in the graph, likewise something with sh:languageIn or sh:pattern.
>>>>> >>
>>>>> >> If we offer such a feature then we may invite disappointment from users, and statements such as "SHACL is slow". Sometimes less is more. Note that any sh:targetShape statement means that even a simple check such as "is node N in the target of S" requires iterating over all sh:targetShapes each time. This can be very expensive.
>>>>> >>
>>>>> >> The implementation cost of this feature is significant, because it requires the implementation of an "inverse validation" algorithm.  Validation starts with a focus node and returns a result. The inverse would start with the shape and has to discover the valid focus nodes. For example, in the case of sh:targetShape [ sh:class X ; sh:property [ sh:path p ; sh:hasValue Z ] ] an algorithm has the choice between first looping over all instances of X and then checking if they have Z or vice versa. Yes, it's an opportunity for developing interesting algorithms, and such an inverse validation algorithm would be beneficial and interesting for many use cases anyway. I personally can at the moment not commit time for such an algorithm so I would, in order to fulfill such a spec, introduce a painfully slow brute-force algorithm. Other implementers may be in the same boat, raising the bar for implementers significantly.
>>>>> >>
>>>>> >> Meanwhile, SPARQL-based targets already exist, and give users control over how efficient the implementation will be able to understand them. For example, such a target could just be "SELECT ?this WHERE { ?this ex:nationality ex:Norwagian }" and any off-the-shelf SPARQL engine can be used to evaluate that.
>>>>> >>
>>>>> >> So while I agree with the use case, and the fact that this might be more direct than sh:filterShape (which has its own problems), I am quite nervous that we are over-promising here.
>>>>> >>
>>>>> >> Do you guys already have implementations of such inverse validation algorithms?
>>>>> >>
>>>>> >> ---
>>>>> >>
>>>>> >> Here is another thought: looking through the Core constraint types, I guess most of them are hard to execute in the inverse order: sh:datatype, sh:nodeKind, sh:minExclusive etc, sh:minLength etc, sh:pattern, sh:languageIn, sh:uniqueLang, sh:lessThan etc, sh:closed, and also the XYcount ones all basically require walking through all subjects and objects in the graph. However, the following are quite easy to revert:
>>>>> >>
>>>>> >> - sh:class (= sh:targetClass)
>>>>> >> - sh:hasValue
>>>>> >> - sh:in
>>>>> >>
>>>>> >> So what if we simply introduce a new target type sh:targetHasValue V where the targets can be identified by a direct look-up. For example
>>>>> >>
>>>>> >> ex:KiwiShape
>>>>> >>      sh:targetHasValue [
>>>>> >>          sh:path ex:nationality ;
>>>>> >>          sh:hasValue ex:NewZealand ;
>>>>> >>      ] ; ...
>>>>> >>
>>>>> >> which amounts to asking ?this ex:nationality ex:NewZealand which is super fast and covers both sh:hasValue and (to lesser extent) sh:in use cases. In fact, such a thing can be easily expressed as a SHACL-SPARQL target type already, and the syntax could be
>>>>> >>
>>>>> >> ex:KiwiShape
>>>>> >>      sh:target [
>>>>> >>          a dash:HasValueTarget ;
>>>>> >>          dash:predicate ex:nationality ;
>>>>> >>          dash:value ex:NewZealand ;
>>>>> >>      ] ; ...
>>>>> >>
>>>>> >> and the underlying SPARQL query would be
>>>>> >>
>>>>> >> SELECT ?this
>>>>> >> WHERE {
>>>>> >>      ?this $predicate $value .
>>>>> >> }
>>>>> >>
>>>>> >> This wouldn't cover all use cases mentioned here, but at least covers the hasValue scenario, and nothing new needs to be implemented or added to the spec.
>>>>> >>
>>>>> >> Holger
>>>>> >>
>>>>> >>
>>>>> >>> On 4/06/2020 19:31, Vladimir Alexiev wrote:
>>>>> >>> Hi everyone! (This email is formatted as markdown)
>>>>> >>>
>>>>> >>> I have 2 objections to earlier proposals:
>>>>> >>> - According to https://www.w3.org/TR/shacl-af/#node-expressions-filter-shape,
>>>>> >>>    `sh:filterShape` is always used with `$this` as seed and `sh:nodes` as generator.
>>>>> >>>    So I don't think it can be used for our case.
>>>>> >>> - It seems wrong to me to use `sh:target` and `sh:filterShape` in a disconnected manner
>>>>> >>>    (the former with just marker classes, the latter to carry the actual target shape)
>>>>> >>>
>>>>> >>> I thought more about what Holger called `sh:targetNodesConforming`, and I think what we need already exists: target by `NodeShape`.
>>>>> >>> So I think we only need to add a new subsection of https://www.w3.org/TR/shacl-af/#targets but no new classes or properties.
>>>>> >>>
>>>>> >>>> Separating sh:AllSubjects and sh:AllObjects separately would offer more flexibility too
>>>>> >>> Both subjects and objects are Nodes in the graph.
>>>>> >>> I think `NodeShape` already gives us enough flexibility to select one or the other
>>>>> >>> (there are 2 related examples below: selecting by IRI pattern, and selecting langString literals).
>>>>> >>> Just like we don't have distinct `SubjectNodeShape` vs `ObjectNodeShape`,
>>>>> >>> I don't think we need such distinction for targeting either.
>>>>> >>>
>>>>> >>> Below is a proposal for such new subsection, please comment.
>>>>> >>>
>>>>> >>> # NodeShape Targets
>>>>> >>>
>>>>> >>> Sometimes it is useful to find nodes by shape, and then validate them using another shape.
>>>>> >>> To do this, you can use `sh:target` that is a `sh:NodeShape`:
>>>>> >>>
>>>>> >>> ```
>>>>> >>> ex:MyNodeShape a sh:NodeShape;
>>>>> >>>    sh:target [a sh:NodeShape;
>>>>> >>>      <NodeShape constructs for target>
>>>>> >>>    ];
>>>>> >>>    <NodeShape constructs for validation>
>>>>> >>> .
>>>>> >>> ```
>>>>> >>>
>>>>> >>> In the following subsections we show several examples of this design.
>>>>> >>>
>>>>> >>> ## Target by Property and Object
>>>>> >>>
>>>>> >>> Norwegians must have one norwegianID:
>>>>> >>>
>>>>> >>> ```
>>>>> >>> ex:NorwegianShape a sh:NodeShape;
>>>>> >>>    sh:target [a sh:NodeShape;
>>>>> >>>      sh:property [sh:path ex:nationality; sh:hasValue ex:Norway];
>>>>> >>>    ];
>>>>> >>>    sh:property [sh:path ex:norwegianID; sh:minCount 1; sh:maxCount 1];
>>>>> >>> .
>>>>> >>> ```
>>>>> >>>
>>>>> >>> ## Target Namespace Instances
>>>>> >>>
>>>>> >>> All instances in a given namespace must have a certain shape:
>>>>> >>>
>>>>> >>> ```
>>>>> >>> ex:CompanyShape a sh:NodeShape;
>>>>> >>>    sh:target [a sh:NodeShape;
>>>>> >>>      sh:nodeKind sh:IRI;
>>>>> >>>      sh:pattern "^https://company-graph.ontotext.com/resource/company/";
>>>>> >>>    ];
>>>>> >>>    sh:class ex:Company;
>>>>> >>>    sh:property [sh:path dc:type; sh:in ("conglomerate" "collective" "enterprise")];
>>>>> >>> .
>>>>> >>> ```
>>>>> >>>
>>>>> >>> ## Target All langStrings
>>>>> >>>
>>>>> >>> All langStrings must have one of a predefind set of languages:
>>>>> >>>
>>>>> >>> ```
>>>>> >>> ex:langStringShape a sh:NodeShape;
>>>>> >>>    sh:target [a sh:NodeShape;
>>>>> >>>      sh:datatype rdf:langString;
>>>>> >>>    ];
>>>>> >>>    sh:languageIn ("en" "bg");
>>>>> >>> .
>>>>> >>> ```
>>>>> >>>
>>>>> >>> ## Target By Cardinality
>>>>> >>>
>>>>> >>> Let's say a person Steve is very popular, so everyone who knows at least three people must know Steve:
>>>>> >>> ```
>>>>> >>> ex:Personshape a sh:NodeShape;
>>>>> >>>    sh:target [a sh:NodeShape;
>>>>> >>>      sh:property [sh:path foaf:knows; sh:minCount 3];
>>>>> >>>    ];
>>>>> >>>    sh:property [sh:path foaf:knows; sh:hasValue ex:Steve];
>>>>> >>> .
>>>>> >>> ```
>>>>> >>>
>>>>> >>> ## Semantic Type Discrimination
>>>>> >>>
>>>>> >>> In some datasets, instances are not discriminated by `rdf:type` alone, but also by other traits.
>>>>> >>> Often more than one check needs to be performed.
>>>>> >>>
>>>>> >>> Eg in Geonames, all instances have type `gn:Feature`, and are further discriminated by `gn:featureCode`.
>>>>> >>> That's a 2-level classification of some 650 codes that includes everything from continents to mountains to pipelines to hotels.
>>>>> >>>
>>>>> >>> Imagine that you're interested only in countries and top-level administrative divisions (states, provinces and the like).
>>>>> >>> - A bunch of codes correspond to the concept "country"
>>>>> >>> - Countries have `gn:countryCode`
>>>>> >>> - Only the code `gn:ADM1` corresponds to top-level administrative divisions
>>>>> >>> - Administrative divisions have `gn:parentCountry`
>>>>> >>> (This does not describe all Geonames fields, only the ones that we need.)
>>>>> >>>
>>>>> >>> ```
>>>>> >>> gn:Feature a sh:NodeShape, rdf:Class;
>>>>> >>>    # implicit: sh:targetClass gn:Feature;
>>>>> >>>    sh:property [sh:path gn:name;         sh:datatype xsd:string; sh:minCount 1; sh:maxCount 1];
>>>>> >>>    sh:property [sh:path gn:featureClass; sh:nodeKind sh:IRI; sh:minCount 1; sh:maxCount 1];
>>>>> >>>    sh:property [sh:path gn:featureCode;  sh:nodeKind sh:IRI; sh:minCount 1; sh:maxCount 1];
>>>>> >>> .
>>>>> >>>
>>>>> >>> ex:CountryShape a sh:NodeShape;
>>>>> >>>    sh:target [a sh:NodeShape;
>>>>> >>>      sh:class gn:Feature;
>>>>> >>>      sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
>>>>> >>>    ];
>>>>> >>>    sh:property [sh:path gn:countryCode; sh:datatype xsd:string; sh:minCount 1; sh:maxCount 1];
>>>>> >>> .
>>>>> >>>
>>>>> >>> ex:ADM1Shape a sh:NodeShape;
>>>>> >>>    sh:target [a sh:NodeShape;
>>>>> >>>      sh:class gn:Feature;
>>>>> >>>      sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
>>>>> >>>    ];
>>>>> >>>    sh:property [sh:path gn:parentCountry; sh:node ex:CountryShape; sh:minCount 1; sh:maxCount 1];
>>>>> >>> .
>>>>> >>> ```
>>>>> >>>
>>>>> >>> ## Targeting and Reference Shapes
>>>>> >>>
>>>>> >>> In the last example we stated that `gn:parentCountry` must point to something that satisfies `ex:CountryShape`.
>>>>> >>> This means that every time we validate `ex:ADM1Shape`, we need to validate its country (together with the country-specific properties).
>>>>> >>> So the validation of ADM1 must recurse into validation of Country.
>>>>> >>>
>>>>> >>> This is not always convenient since it's hard to control this recursive process.
>>>>> >>> Furthermore, if Country referred back to `ex:ADM1Shape` of its regions, we'd have a recursive shape and the result would be undefined.
>>>>> >>>
>>>>> >>> It may therefore be more convenient to check only the **existence** of Country from ADM1,
>>>>> >>> and depend that some other process will check the validity of Country.
>>>>> >>> We could do it like this:
>>>>> >>>
>>>>> >>> ```
>>>>> >>> ex:CountryReferenceShape a sh:NodeShape;
>>>>> >>>    sh:class gn:Feature;
>>>>> >>>    sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
>>>>> >>> .
>>>>> >>>
>>>>> >>> ex:CountryShape a sh:NodeShape;
>>>>> >>>    sh:target ex:CountryReferenceShape;
>>>>> >>>    sh:property [sh:path gn:countryCode; sh:datatype xsd:string; sh:minCount 1; sh:maxCount 1];
>>>>> >>> .
>>>>> >>>
>>>>> >>> ex:ADM1ReferenceShape a sh:NodeShape;
>>>>> >>>    sh:class gn:Feature;
>>>>> >>>    sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
>>>>> >>> .
>>>>> >>>
>>>>> >>> ex:ADM1Shape a sh:NodeShape;
>>>>> >>>    sh:target ex:ADM1ReferenceShape;
>>>>> >>>    sh:property [sh:path gn:parentCountry; sh:node ex:CountryReferenceShape; sh:minCount 1; sh:maxCount 1];
>>>>> >>> .
>>>>> >>> ```
>>>>> >>>
>>>>> >>> The significant change is in the last line: ADM1 checks `ex:CountryReferenceShape` rather than `ex:CountryShape`.
>>>>> >>> And we reuse `ex:CountryReferenceShape` as both:
>>>>> >>> - Existence check in `ex:ADM1Shape`
>>>>> >>> - Targeting shape in `ex:CountryShape`
>>>>> >>>
>>>>> >>> ## Politicians and Parties
>>>>> >>>
>>>>> >>> Let's say every Party has at least one Politician,
>>>>> >>> every Politician belongs to exactly one Party (ok, that is unrealistic),
>>>>> >>> politicians are defined by a combination of `rdf:type` and `dc:type`,
>>>>> >>> and both Parties and Politicians adhere to one of two politics (Democrat vs Republican).
>>>>> >>>
>>>>> >>> If we model this with two shapes that refer to each other, we'd have recursive shapes.
>>>>> >>> So again we use two shapes for every entity:
>>>>> >>> - A "smaller" ReferenceShape that just checks existence in terms of "semantic type discrimination"
>>>>> >>> - A "bigger" Shape that checks all other properties of the instance, and uses the ReferenceShape for targeting
>>>>> >>>
>>>>> >>> This eliminates the recursion.
>>>>> >>>
>>>>> >>> ```
>>>>> >>> ex:PoliticianReferenceShape a sh:NodeShape;
>>>>> >>>    sh:property [sh:path rdf:type; sh:in (foaf:Person dbo:Person)];
>>>>> >>>    sh:property [sh:path dc:type; sh:hasValue "politician"];
>>>>> >>> .
>>>>> >>> ex:PoliticianShape a sh:NodeShape;
>>>>> >>>    sh:target ex:PoliticianReferenceShape;
>>>>> >>>    sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")];
>>>>> >>>    sh:property [sh:path ex:party; sh:node ex:PartyReferenceShape; sh:minCount 1; sh:maxCount 1];
>>>>> >>> .
>>>>> >>> ex:PartyReference a sh:NodeShape;
>>>>> >>>    sh:property [sh:path rdf:type; sh:hasValue foaf:Organization];
>>>>> >>>    sh:property [sh:path dc:type; sh:hasValue "political party"];
>>>>> >>> .
>>>>> >>> ex:PartyShape a sh:NodeShape;
>>>>> >>>    sh:target ex:PartyReferenceShape;
>>>>> >>>    sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")];
>>>>> >>>    sh:property [sh:path ex:politician; sh:node ex:PoliticianReferenceShape; sh:minCount 1];
>>>>> >>> .
>>>>> >>> ```
Received on Friday, 5 June 2020 08:45:57 UTC