Re: SHACL target extension from Holger Knublauch on 2020-06-05 (public-shacl@w3.org from June 2020)

From: Holger Knublauch <holger@topquadrant.com>
Date: Fri, 5 Jun 2020 18:32:27 +1000
To: Håvard Ottestad <hmottestad@gmail.com>
Cc: Public Shacl W3C <public-shacl@w3.org>
Message-ID: <16bfce22-0106-1d92-1d6b-6cc2ec01faa0@topquadrant.com>
On 5/06/2020 18:00, Håvard Ottestad wrote:
> We are planning on generating a single SPARQL query for that case. We 
> haven't started working on this yet. Our plan is to have two 
> approaches: we analyze the transaction and estimate the cost of a 
> "transactional" validation and a "full SPARQL" validation and run 
> whichever is faster for that particular transaction.
>
> Wouldn't dash:AllSubject and dash:AllObjects be just as slow, or 
> sh:targetClass rdfs:Resource for that matter? How would these be 
> optimized better than a target shape representing "all objects that 
> match a regex pattern"?

Yes that would be just as slow. I am not advocating that. Just wanted to 
point out that there are risks of offering features that are too 
overpowering and thus very difficult for us implementers to cover properly.

Holger


>
> Håvard
>
> On Fri, Jun 5, 2020 at 9:31 AM Holger Knublauch 
> <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>
>
>     On 5/06/2020 17:02, Håvard Ottestad wrote:
>>     >Ok, does this apply to the case where you have a target shape
>>     and want
>>     to find all nodes in the graph that conform?
>>
>>     Yes. All those are trivial as target shapes.
>>
>>     For the example below the data added by the user becomes the
>>     starting point
>
>     Ok, that's of course easier because you already have a small
>     subset of node. But then we are not talking about the same use
>     case. What happens if you need to run the full validation of the
>     full graph? E.g. someone puts a sh:pattern on rdfs:label and there
>     are (which is realistic) millions of labels in the database?
>
>     Holger
>
>
>>     for the validation. A target is either added in this transaction,
>>     in which case we retrieve all its foaf:age paths and validate
>>     those. Or a path is added to an existing target, in which case we
>>     have a node to start on (the subject of the path).
>>
>>     ex:CompanyShape a sh:NodeShape;
>>       sh:target [a sh:NodeShape;
>>         sh:nodeKind sh:IRI;
>>         sh:pattern
>>     "^https://company-graph.ontotext.com/resource/company/";
>>       ];
>>       sh:property [sh:path foaf:age; sh:datatype xsd:integer ];
>>     .
>>
>>     Håvard
>>
>>     On Fri, Jun 5, 2020 at 8:28 AM Holger Knublauch
>>     <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>>
>>
>>         On 5/06/2020 15:49, Håvard Ottestad wrote:
>>         > Hi,
>>         >
>>         > Just a quick response performance wise.
>>         >
>>         > SPARQL targets are very slow because the RDF4J ShaclSail
>>         can’t analyze a transaction to decide what to validate. Shape
>>         based targets on the other hand can be used to generate a
>>         validation plan that utilizes the changeset of the
>>         transaction to only validate a small subset of the data.
>>         The design of named SPARQL targets means that if a name gets
>>         established
>>         (e.g. as a de-facto standard) then an engine may hard-code
>>         it. However,
>>         the SPARQL remains as a fallback.
>>         >
>>         > The more complex the target shape the larger this subset
>>         becomes and the more data needs to be considered.
>>         >
>>         > For us sh:datatype, sh:nodeKind, sh:minExclusive etc,
>>         sh:minLength etc, sh:pattern, sh:languageIn, sh:uniqueLang
>>         are actually trivial to validate when used in single
>>         predicate path shapes.
>>
>>         Ok, does this apply to the case where you have a target shape
>>         and want
>>         to find all nodes in the graph that conform?
>>
>>         Holger
>>
>>
>>         >
>>         > We are currently supporting sh:hasValue, sh:or, sh:and,
>>         sh:property and sh:path as long as the effective path is a
>>         single predicate (so no nested sh:property).
>>         >
>>         > Håvard
>>         >
>>         >> On 5 Jun 2020, at 04:30, Holger Knublauch
>>         <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>>         >>
>>         >> Hi Vladimir,
>>         >>
>>         >> from a specification point of view I see no show stoppers
>>         to introducing such a mechanism. I would however introduce a
>>         new property instead of sh:target, because the meaning of
>>         sh:target would otherwise be overloaded and it is possible
>>         for targets to also be sh:NodeShapes in which case the result
>>         will be very surprising. So, IMHO it should be something like
>>         sh:targetShape (or the earlier, verbose
>>         sh:targetNodesConforming).
>>         >>
>>         >>  From a practical point of view, I remain very nervous
>>         about performance implications. It will be too easy for users
>>         to produce some really inefficient scenarios where any
>>         implementation almost certainly must iterate over all nodes
>>         in the whole graph.E.g. sh:targetShape [ sh:datatype
>>         xsd:string ] requires walking through all existing objects in
>>         the graph, likewise something with sh:languageIn or sh:pattern.
>>         >>
>>         >> If we offer such a feature then we may invite
>>         disappointment from users, and statements such as "SHACL is
>>         slow". Sometimes less is more. Note that any sh:targetShape
>>         statement means that even a simple check such as "is node N
>>         in the target of S" requires iterating over all
>>         sh:targetShapes each time. This can be very expensive.
>>         >>
>>         >> The implementation cost of this feature is significant,
>>         because it requires the implementation of an "inverse
>>         validation" algorithm.  Validation starts with a focus node
>>         and returns a result. The inverse would start with the shape
>>         and has to discover the valid focus nodes. For example, in
>>         the case of sh:targetShape [ sh:class X ; sh:property [
>>         sh:path p ; sh:hasValue Z ] ] an algorithm has the choice
>>         between first looping over all instances of X and then
>>         checking if they have Z or vice versa. Yes, it's an
>>         opportunity for developing interesting algorithms, and such
>>         an inverse validation algorithm would be beneficial and
>>         interesting for many use cases anyway. I personally can at
>>         the moment not commit time for such an algorithm so I would,
>>         in order to fulfill such a spec, introduce a painfully slow
>>         brute-force algorithm. Other implementers may be in the same
>>         boat, raising the bar for implementers significantly.
>>         >>
>>         >> Meanwhile, SPARQL-based targets already exist, and give
>>         users control over how efficient the implementation will be
>>         able to understand them. For example, such a target could
>>         just be "SELECT ?this WHERE { ?this ex:nationality
>>         ex:Norwagian }" and any off-the-shelf SPARQL engine can be
>>         used to evaluate that.
>>         >>
>>         >> So while I agree with the use case, and the fact that this
>>         might be more direct than sh:filterShape (which has its own
>>         problems), I am quite nervous that we are over-promising here.
>>         >>
>>         >> Do you guys already have implementations of such inverse
>>         validation algorithms?
>>         >>
>>         >> ---
>>         >>
>>         >> Here is another thought: looking through the Core
>>         constraint types, I guess most of them are hard to execute in
>>         the inverse order: sh:datatype, sh:nodeKind, sh:minExclusive
>>         etc, sh:minLength etc, sh:pattern, sh:languageIn,
>>         sh:uniqueLang, sh:lessThan etc, sh:closed, and also the
>>         XYcount ones all basically require walking through all
>>         subjects and objects in the graph. However, the following are
>>         quite easy to revert:
>>         >>
>>         >> - sh:class (= sh:targetClass)
>>         >> - sh:hasValue
>>         >> - sh:in
>>         >>
>>         >> So what if we simply introduce a new target type
>>         sh:targetHasValue V where the targets can be identified by a
>>         direct look-up. For example
>>         >>
>>         >> ex:KiwiShape
>>         >>      sh:targetHasValue [
>>         >>          sh:path ex:nationality ;
>>         >>          sh:hasValue ex:NewZealand ;
>>         >>      ] ; ...
>>         >>
>>         >> which amounts to asking ?this ex:nationality ex:NewZealand
>>         which is super fast and covers both sh:hasValue and (to
>>         lesser extent) sh:in use cases. In fact, such a thing can be
>>         easily expressed as a SHACL-SPARQL target type already, and
>>         the syntax could be
>>         >>
>>         >> ex:KiwiShape
>>         >>      sh:target [
>>         >>          a dash:HasValueTarget ;
>>         >>          dash:predicate ex:nationality ;
>>         >>          dash:value ex:NewZealand ;
>>         >>      ] ; ...
>>         >>
>>         >> and the underlying SPARQL query would be
>>         >>
>>         >> SELECT ?this
>>         >> WHERE {
>>         >>      ?this $predicate $value .
>>         >> }
>>         >>
>>         >> This wouldn't cover all use cases mentioned here, but at
>>         least covers the hasValue scenario, and nothing new needs to
>>         be implemented or added to the spec.
>>         >>
>>         >> Holger
>>         >>
>>         >>
>>         >>> On 4/06/2020 19:31, Vladimir Alexiev wrote:
>>         >>> Hi everyone! (This email is formatted as markdown)
>>         >>>
>>         >>> I have 2 objections to earlier proposals:
>>         >>> - According to
>>         https://www.w3.org/TR/shacl-af/#node-expressions-filter-shape,
>>         >>>    `sh:filterShape` is always used with `$this` as seed
>>         and `sh:nodes` as generator.
>>         >>>    So I don't think it can be used for our case.
>>         >>> - It seems wrong to me to use `sh:target` and
>>         `sh:filterShape` in a disconnected manner
>>         >>>    (the former with just marker classes, the latter to
>>         carry the actual target shape)
>>         >>>
>>         >>> I thought more about what Holger called
>>         `sh:targetNodesConforming`, and I think what we need already
>>         exists: target by `NodeShape`.
>>         >>> So I think we only need to add a new subsection of
>>         https://www.w3.org/TR/shacl-af/#targets but no new classes or
>>         properties.
>>         >>>
>>         >>>> Separating sh:AllSubjects and sh:AllObjects separately
>>         would offer more flexibility too
>>         >>> Both subjects and objects are Nodes in the graph.
>>         >>> I think `NodeShape` already gives us enough flexibility
>>         to select one or the other
>>         >>> (there are 2 related examples below: selecting by IRI
>>         pattern, and selecting langString literals).
>>         >>> Just like we don't have distinct `SubjectNodeShape` vs
>>         `ObjectNodeShape`,
>>         >>> I don't think we need such distinction for targeting either.
>>         >>>
>>         >>> Below is a proposal for such new subsection, please comment.
>>         >>>
>>         >>> # NodeShape Targets
>>         >>>
>>         >>> Sometimes it is useful to find nodes by shape, and then
>>         validate them using another shape.
>>         >>> To do this, you can use `sh:target` that is a `sh:NodeShape`:
>>         >>>
>>         >>> ```
>>         >>> ex:MyNodeShape a sh:NodeShape;
>>         >>>    sh:target [a sh:NodeShape;
>>         >>>      <NodeShape constructs for target>
>>         >>>    ];
>>         >>>    <NodeShape constructs for validation>
>>         >>> .
>>         >>> ```
>>         >>>
>>         >>> In the following subsections we show several examples of
>>         this design.
>>         >>>
>>         >>> ## Target by Property and Object
>>         >>>
>>         >>> Norwegians must have one norwegianID:
>>         >>>
>>         >>> ```
>>         >>> ex:NorwegianShape a sh:NodeShape;
>>         >>>    sh:target [a sh:NodeShape;
>>         >>>      sh:property [sh:path ex:nationality; sh:hasValue
>>         ex:Norway];
>>         >>>    ];
>>         >>>    sh:property [sh:path ex:norwegianID; sh:minCount 1;
>>         sh:maxCount 1];
>>         >>> .
>>         >>> ```
>>         >>>
>>         >>> ## Target Namespace Instances
>>         >>>
>>         >>> All instances in a given namespace must have a certain shape:
>>         >>>
>>         >>> ```
>>         >>> ex:CompanyShape a sh:NodeShape;
>>         >>>    sh:target [a sh:NodeShape;
>>         >>>      sh:nodeKind sh:IRI;
>>         >>>      sh:pattern
>>         "^https://company-graph.ontotext.com/resource/company/";
>>         >>>    ];
>>         >>>    sh:class ex:Company;
>>         >>>    sh:property [sh:path dc:type; sh:in ("conglomerate"
>>         "collective" "enterprise")];
>>         >>> .
>>         >>> ```
>>         >>>
>>         >>> ## Target All langStrings
>>         >>>
>>         >>> All langStrings must have one of a predefind set of
>>         languages:
>>         >>>
>>         >>> ```
>>         >>> ex:langStringShape a sh:NodeShape;
>>         >>>    sh:target [a sh:NodeShape;
>>         >>>      sh:datatype rdf:langString;
>>         >>>    ];
>>         >>>    sh:languageIn ("en" "bg");
>>         >>> .
>>         >>> ```
>>         >>>
>>         >>> ## Target By Cardinality
>>         >>>
>>         >>> Let's say a person Steve is very popular, so everyone who
>>         knows at least three people must know Steve:
>>         >>> ```
>>         >>> ex:Personshape a sh:NodeShape;
>>         >>>    sh:target [a sh:NodeShape;
>>         >>>      sh:property [sh:path foaf:knows; sh:minCount 3];
>>         >>>    ];
>>         >>>    sh:property [sh:path foaf:knows; sh:hasValue ex:Steve];
>>         >>> .
>>         >>> ```
>>         >>>
>>         >>> ## Semantic Type Discrimination
>>         >>>
>>         >>> In some datasets, instances are not discriminated by
>>         `rdf:type` alone, but also by other traits.
>>         >>> Often more than one check needs to be performed.
>>         >>>
>>         >>> Eg in Geonames, all instances have type `gn:Feature`, and
>>         are further discriminated by `gn:featureCode`.
>>         >>> That's a 2-level classification of some 650 codes that
>>         includes everything from continents to mountains to pipelines
>>         to hotels.
>>         >>>
>>         >>> Imagine that you're interested only in countries and
>>         top-level administrative divisions (states, provinces and the
>>         like).
>>         >>> - A bunch of codes correspond to the concept "country"
>>         >>> - Countries have `gn:countryCode`
>>         >>> - Only the code `gn:ADM1` corresponds to top-level
>>         administrative divisions
>>         >>> - Administrative divisions have `gn:parentCountry`
>>         >>> (This does not describe all Geonames fields, only the
>>         ones that we need.)
>>         >>>
>>         >>> ```
>>         >>> gn:Feature a sh:NodeShape, rdf:Class;
>>         >>>    # implicit: sh:targetClass gn:Feature;
>>         >>>    sh:property [sh:path gn:name;  sh:datatype xsd:string;
>>         sh:minCount 1; sh:maxCount 1];
>>         >>>    sh:property [sh:path gn:featureClass; sh:nodeKind
>>         sh:IRI; sh:minCount 1; sh:maxCount 1];
>>         >>>    sh:property [sh:path gn:featureCode; sh:nodeKind
>>         sh:IRI; sh:minCount 1; sh:maxCount 1];
>>         >>> .
>>         >>>
>>         >>> ex:CountryShape a sh:NodeShape;
>>         >>>    sh:target [a sh:NodeShape;
>>         >>>      sh:class gn:Feature;
>>         >>>      sh:property [sh:path gn:featureCode; sh:in
>>         (gn:A.PCLI gn:A.PCLD gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR
>>         gn:A.PCLF)];
>>         >>>    ];
>>         >>>    sh:property [sh:path gn:countryCode; sh:datatype
>>         xsd:string; sh:minCount 1; sh:maxCount 1];
>>         >>> .
>>         >>>
>>         >>> ex:ADM1Shape a sh:NodeShape;
>>         >>>    sh:target [a sh:NodeShape;
>>         >>>      sh:class gn:Feature;
>>         >>>      sh:property [sh:path gn:featureCode; sh:hasValue
>>         gn:ADM1];
>>         >>>    ];
>>         >>>    sh:property [sh:path gn:parentCountry; sh:node
>>         ex:CountryShape; sh:minCount 1; sh:maxCount 1];
>>         >>> .
>>         >>> ```
>>         >>>
>>         >>> ## Targeting and Reference Shapes
>>         >>>
>>         >>> In the last example we stated that `gn:parentCountry`
>>         must point to something that satisfies `ex:CountryShape`.
>>         >>> This means that every time we validate `ex:ADM1Shape`, we
>>         need to validate its country (together with the
>>         country-specific properties).
>>         >>> So the validation of ADM1 must recurse into validation of
>>         Country.
>>         >>>
>>         >>> This is not always convenient since it's hard to control
>>         this recursive process.
>>         >>> Furthermore, if Country referred back to `ex:ADM1Shape`
>>         of its regions, we'd have a recursive shape and the result
>>         would be undefined.
>>         >>>
>>         >>> It may therefore be more convenient to check only the
>>         **existence** of Country from ADM1,
>>         >>> and depend that some other process will check the
>>         validity of Country.
>>         >>> We could do it like this:
>>         >>>
>>         >>> ```
>>         >>> ex:CountryReferenceShape a sh:NodeShape;
>>         >>>    sh:class gn:Feature;
>>         >>>    sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI
>>         gn:A.PCLD gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
>>         >>> .
>>         >>>
>>         >>> ex:CountryShape a sh:NodeShape;
>>         >>>    sh:target ex:CountryReferenceShape;
>>         >>>    sh:property [sh:path gn:countryCode; sh:datatype
>>         xsd:string; sh:minCount 1; sh:maxCount 1];
>>         >>> .
>>         >>>
>>         >>> ex:ADM1ReferenceShape a sh:NodeShape;
>>         >>>    sh:class gn:Feature;
>>         >>>    sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
>>         >>> .
>>         >>>
>>         >>> ex:ADM1Shape a sh:NodeShape;
>>         >>>    sh:target ex:ADM1ReferenceShape;
>>         >>>    sh:property [sh:path gn:parentCountry; sh:node
>>         ex:CountryReferenceShape; sh:minCount 1; sh:maxCount 1];
>>         >>> .
>>         >>> ```
>>         >>>
>>         >>> The significant change is in the last line: ADM1 checks
>>         `ex:CountryReferenceShape` rather than `ex:CountryShape`.
>>         >>> And we reuse `ex:CountryReferenceShape` as both:
>>         >>> - Existence check in `ex:ADM1Shape`
>>         >>> - Targeting shape in `ex:CountryShape`
>>         >>>
>>         >>> ## Politicians and Parties
>>         >>>
>>         >>> Let's say every Party has at least one Politician,
>>         >>> every Politician belongs to exactly one Party (ok, that
>>         is unrealistic),
>>         >>> politicians are defined by a combination of `rdf:type`
>>         and `dc:type`,
>>         >>> and both Parties and Politicians adhere to one of two
>>         politics (Democrat vs Republican).
>>         >>>
>>         >>> If we model this with two shapes that refer to each
>>         other, we'd have recursive shapes.
>>         >>> So again we use two shapes for every entity:
>>         >>> - A "smaller" ReferenceShape that just checks existence
>>         in terms of "semantic type discrimination"
>>         >>> - A "bigger" Shape that checks all other properties of
>>         the instance, and uses the ReferenceShape for targeting
>>         >>>
>>         >>> This eliminates the recursion.
>>         >>>
>>         >>> ```
>>         >>> ex:PoliticianReferenceShape a sh:NodeShape;
>>         >>>    sh:property [sh:path rdf:type; sh:in (foaf:Person
>>         dbo:Person)];
>>         >>>    sh:property [sh:path dc:type; sh:hasValue "politician"];
>>         >>> .
>>         >>> ex:PoliticianShape a sh:NodeShape;
>>         >>>    sh:target ex:PoliticianReferenceShape;
>>         >>>    sh:property [sh:path ex:politics; sh:in ("Democrat"
>>         "Republican")];
>>         >>>    sh:property [sh:path ex:party; sh:node
>>         ex:PartyReferenceShape; sh:minCount 1; sh:maxCount 1];
>>         >>> .
>>         >>> ex:PartyReference a sh:NodeShape;
>>         >>>    sh:property [sh:path rdf:type; sh:hasValue
>>         foaf:Organization];
>>         >>>    sh:property [sh:path dc:type; sh:hasValue "political
>>         party"];
>>         >>> .
>>         >>> ex:PartyShape a sh:NodeShape;
>>         >>>    sh:target ex:PartyReferenceShape;
>>         >>>    sh:property [sh:path ex:politics; sh:in ("Democrat"
>>         "Republican")];
>>         >>>    sh:property [sh:path ex:politician; sh:node
>>         ex:PoliticianReferenceShape; sh:minCount 1];
>>         >>> .
>>         >>> ```
>>
Received on Friday, 5 June 2020 08:32:48 UTC