Re: SHACL target extension from Holger Knublauch on 2020-06-05 (public-shacl@w3.org from June 2020)

From: Holger Knublauch <holger@topquadrant.com>
Date: Fri, 5 Jun 2020 17:31:12 +1000
To: public-shacl@w3.org
Message-ID: <755ff25e-03e8-60db-9ecc-f2ee24a4e149@topquadrant.com>
On 5/06/2020 17:02, Håvard Ottestad wrote:
> >Ok, does this apply to the case where you have a target shape and want
> to find all nodes in the graph that conform?
>
> Yes. All those are trivial as target shapes.
>
> For the example below the data added by the user becomes the starting 
> point

Ok, that's of course easier because you already have a small subset of 
node. But then we are not talking about the same use case. What happens 
if you need to run the full validation of the full graph? E.g. someone 
puts a sh:pattern on rdfs:label and there are (which is realistic) 
millions of labels in the database?

Holger


> for the validation. A target is either added in this transaction, in 
> which case we retrieve all its foaf:age paths and validate those. Or a 
> path is added to an existing target, in which case we have a node to 
> start on (the subject of the path).
>
> ex:CompanyShape a sh:NodeShape;
>   sh:target [a sh:NodeShape;
>     sh:nodeKind sh:IRI;
>     sh:pattern "^https://company-graph.ontotext.com/resource/company/";
>   ];
>   sh:property [sh:path foaf:age; sh:datatype xsd:integer ];
> .
>
> Håvard
>
> On Fri, Jun 5, 2020 at 8:28 AM Holger Knublauch 
> <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>
>
>     On 5/06/2020 15:49, Håvard Ottestad wrote:
>     > Hi,
>     >
>     > Just a quick response performance wise.
>     >
>     > SPARQL targets are very slow because the RDF4J ShaclSail can’t
>     analyze a transaction to decide what to validate. Shape based
>     targets on the other hand can be used to generate a validation
>     plan that utilizes the changeset of the transaction to only
>     validate a small subset of the data.
>     The design of named SPARQL targets means that if a name gets
>     established
>     (e.g. as a de-facto standard) then an engine may hard-code it.
>     However,
>     the SPARQL remains as a fallback.
>     >
>     > The more complex the target shape the larger this subset becomes
>     and the more data needs to be considered.
>     >
>     > For us sh:datatype, sh:nodeKind, sh:minExclusive etc,
>     sh:minLength etc, sh:pattern, sh:languageIn, sh:uniqueLang are
>     actually trivial to validate when used in single predicate path
>     shapes.
>
>     Ok, does this apply to the case where you have a target shape and
>     want
>     to find all nodes in the graph that conform?
>
>     Holger
>
>
>     >
>     > We are currently supporting sh:hasValue, sh:or, sh:and,
>     sh:property and sh:path as long as the effective path is a single
>     predicate (so no nested sh:property).
>     >
>     > Håvard
>     >
>     >> On 5 Jun 2020, at 04:30, Holger Knublauch
>     <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>     >>
>     >> Hi Vladimir,
>     >>
>     >> from a specification point of view I see no show stoppers to
>     introducing such a mechanism. I would however introduce a new
>     property instead of sh:target, because the meaning of sh:target
>     would otherwise be overloaded and it is possible for targets to
>     also be sh:NodeShapes in which case the result will be very
>     surprising. So, IMHO it should be something like sh:targetShape
>     (or the earlier, verbose sh:targetNodesConforming).
>     >>
>     >>  From a practical point of view, I remain very nervous about
>     performance implications. It will be too easy for users to produce
>     some really inefficient scenarios where any implementation almost
>     certainly must iterate over all nodes in the whole graph.E.g.
>     sh:targetShape [ sh:datatype xsd:string ] requires walking through
>     all existing objects in the graph, likewise something with
>     sh:languageIn or sh:pattern.
>     >>
>     >> If we offer such a feature then we may invite disappointment
>     from users, and statements such as "SHACL is slow". Sometimes less
>     is more. Note that any sh:targetShape statement means that even a
>     simple check such as "is node N in the target of S" requires
>     iterating over all sh:targetShapes each time. This can be very
>     expensive.
>     >>
>     >> The implementation cost of this feature is significant, because
>     it requires the implementation of an "inverse validation"
>     algorithm.  Validation starts with a focus node and returns a
>     result. The inverse would start with the shape and has to discover
>     the valid focus nodes. For example, in the case of sh:targetShape
>     [ sh:class X ; sh:property [ sh:path p ; sh:hasValue Z ] ] an
>     algorithm has the choice between first looping over all instances
>     of X and then checking if they have Z or vice versa. Yes, it's an
>     opportunity for developing interesting algorithms, and such an
>     inverse validation algorithm would be beneficial and interesting
>     for many use cases anyway. I personally can at the moment not
>     commit time for such an algorithm so I would, in order to fulfill
>     such a spec, introduce a painfully slow brute-force algorithm.
>     Other implementers may be in the same boat, raising the bar for
>     implementers significantly.
>     >>
>     >> Meanwhile, SPARQL-based targets already exist, and give users
>     control over how efficient the implementation will be able to
>     understand them. For example, such a target could just be "SELECT
>     ?this WHERE { ?this ex:nationality ex:Norwagian }" and any
>     off-the-shelf SPARQL engine can be used to evaluate that.
>     >>
>     >> So while I agree with the use case, and the fact that this
>     might be more direct than sh:filterShape (which has its own
>     problems), I am quite nervous that we are over-promising here.
>     >>
>     >> Do you guys already have implementations of such inverse
>     validation algorithms?
>     >>
>     >> ---
>     >>
>     >> Here is another thought: looking through the Core constraint
>     types, I guess most of them are hard to execute in the inverse
>     order: sh:datatype, sh:nodeKind, sh:minExclusive etc, sh:minLength
>     etc, sh:pattern, sh:languageIn, sh:uniqueLang, sh:lessThan etc,
>     sh:closed, and also the XYcount ones all basically require walking
>     through all subjects and objects in the graph. However, the
>     following are quite easy to revert:
>     >>
>     >> - sh:class (= sh:targetClass)
>     >> - sh:hasValue
>     >> - sh:in
>     >>
>     >> So what if we simply introduce a new target type
>     sh:targetHasValue V where the targets can be identified by a
>     direct look-up. For example
>     >>
>     >> ex:KiwiShape
>     >>      sh:targetHasValue [
>     >>          sh:path ex:nationality ;
>     >>          sh:hasValue ex:NewZealand ;
>     >>      ] ; ...
>     >>
>     >> which amounts to asking ?this ex:nationality ex:NewZealand
>     which is super fast and covers both sh:hasValue and (to lesser
>     extent) sh:in use cases. In fact, such a thing can be easily
>     expressed as a SHACL-SPARQL target type already, and the syntax
>     could be
>     >>
>     >> ex:KiwiShape
>     >>      sh:target [
>     >>          a dash:HasValueTarget ;
>     >>          dash:predicate ex:nationality ;
>     >>          dash:value ex:NewZealand ;
>     >>      ] ; ...
>     >>
>     >> and the underlying SPARQL query would be
>     >>
>     >> SELECT ?this
>     >> WHERE {
>     >>      ?this $predicate $value .
>     >> }
>     >>
>     >> This wouldn't cover all use cases mentioned here, but at least
>     covers the hasValue scenario, and nothing new needs to be
>     implemented or added to the spec.
>     >>
>     >> Holger
>     >>
>     >>
>     >>> On 4/06/2020 19:31, Vladimir Alexiev wrote:
>     >>> Hi everyone! (This email is formatted as markdown)
>     >>>
>     >>> I have 2 objections to earlier proposals:
>     >>> - According to
>     https://www.w3.org/TR/shacl-af/#node-expressions-filter-shape,
>     >>>    `sh:filterShape` is always used with `$this` as seed and
>     `sh:nodes` as generator.
>     >>>    So I don't think it can be used for our case.
>     >>> - It seems wrong to me to use `sh:target` and `sh:filterShape`
>     in a disconnected manner
>     >>>    (the former with just marker classes, the latter to carry
>     the actual target shape)
>     >>>
>     >>> I thought more about what Holger called
>     `sh:targetNodesConforming`, and I think what we need already
>     exists: target by `NodeShape`.
>     >>> So I think we only need to add a new subsection of
>     https://www.w3.org/TR/shacl-af/#targets but no new classes or
>     properties.
>     >>>
>     >>>> Separating sh:AllSubjects and sh:AllObjects separately would
>     offer more flexibility too
>     >>> Both subjects and objects are Nodes in the graph.
>     >>> I think `NodeShape` already gives us enough flexibility to
>     select one or the other
>     >>> (there are 2 related examples below: selecting by IRI pattern,
>     and selecting langString literals).
>     >>> Just like we don't have distinct `SubjectNodeShape` vs
>     `ObjectNodeShape`,
>     >>> I don't think we need such distinction for targeting either.
>     >>>
>     >>> Below is a proposal for such new subsection, please comment.
>     >>>
>     >>> # NodeShape Targets
>     >>>
>     >>> Sometimes it is useful to find nodes by shape, and then
>     validate them using another shape.
>     >>> To do this, you can use `sh:target` that is a `sh:NodeShape`:
>     >>>
>     >>> ```
>     >>> ex:MyNodeShape a sh:NodeShape;
>     >>>    sh:target [a sh:NodeShape;
>     >>>      <NodeShape constructs for target>
>     >>>    ];
>     >>>    <NodeShape constructs for validation>
>     >>> .
>     >>> ```
>     >>>
>     >>> In the following subsections we show several examples of this
>     design.
>     >>>
>     >>> ## Target by Property and Object
>     >>>
>     >>> Norwegians must have one norwegianID:
>     >>>
>     >>> ```
>     >>> ex:NorwegianShape a sh:NodeShape;
>     >>>    sh:target [a sh:NodeShape;
>     >>>      sh:property [sh:path ex:nationality; sh:hasValue ex:Norway];
>     >>>    ];
>     >>>    sh:property [sh:path ex:norwegianID; sh:minCount 1;
>     sh:maxCount 1];
>     >>> .
>     >>> ```
>     >>>
>     >>> ## Target Namespace Instances
>     >>>
>     >>> All instances in a given namespace must have a certain shape:
>     >>>
>     >>> ```
>     >>> ex:CompanyShape a sh:NodeShape;
>     >>>    sh:target [a sh:NodeShape;
>     >>>      sh:nodeKind sh:IRI;
>     >>>      sh:pattern
>     "^https://company-graph.ontotext.com/resource/company/";
>     >>>    ];
>     >>>    sh:class ex:Company;
>     >>>    sh:property [sh:path dc:type; sh:in ("conglomerate"
>     "collective" "enterprise")];
>     >>> .
>     >>> ```
>     >>>
>     >>> ## Target All langStrings
>     >>>
>     >>> All langStrings must have one of a predefind set of languages:
>     >>>
>     >>> ```
>     >>> ex:langStringShape a sh:NodeShape;
>     >>>    sh:target [a sh:NodeShape;
>     >>>      sh:datatype rdf:langString;
>     >>>    ];
>     >>>    sh:languageIn ("en" "bg");
>     >>> .
>     >>> ```
>     >>>
>     >>> ## Target By Cardinality
>     >>>
>     >>> Let's say a person Steve is very popular, so everyone who
>     knows at least three people must know Steve:
>     >>> ```
>     >>> ex:Personshape a sh:NodeShape;
>     >>>    sh:target [a sh:NodeShape;
>     >>>      sh:property [sh:path foaf:knows; sh:minCount 3];
>     >>>    ];
>     >>>    sh:property [sh:path foaf:knows; sh:hasValue ex:Steve];
>     >>> .
>     >>> ```
>     >>>
>     >>> ## Semantic Type Discrimination
>     >>>
>     >>> In some datasets, instances are not discriminated by
>     `rdf:type` alone, but also by other traits.
>     >>> Often more than one check needs to be performed.
>     >>>
>     >>> Eg in Geonames, all instances have type `gn:Feature`, and are
>     further discriminated by `gn:featureCode`.
>     >>> That's a 2-level classification of some 650 codes that
>     includes everything from continents to mountains to pipelines to
>     hotels.
>     >>>
>     >>> Imagine that you're interested only in countries and top-level
>     administrative divisions (states, provinces and the like).
>     >>> - A bunch of codes correspond to the concept "country"
>     >>> - Countries have `gn:countryCode`
>     >>> - Only the code `gn:ADM1` corresponds to top-level
>     administrative divisions
>     >>> - Administrative divisions have `gn:parentCountry`
>     >>> (This does not describe all Geonames fields, only the ones
>     that we need.)
>     >>>
>     >>> ```
>     >>> gn:Feature a sh:NodeShape, rdf:Class;
>     >>>    # implicit: sh:targetClass gn:Feature;
>     >>>    sh:property [sh:path gn:name;  sh:datatype xsd:string;
>     sh:minCount 1; sh:maxCount 1];
>     >>>    sh:property [sh:path gn:featureClass; sh:nodeKind sh:IRI;
>     sh:minCount 1; sh:maxCount 1];
>     >>>    sh:property [sh:path gn:featureCode; sh:nodeKind sh:IRI;
>     sh:minCount 1; sh:maxCount 1];
>     >>> .
>     >>>
>     >>> ex:CountryShape a sh:NodeShape;
>     >>>    sh:target [a sh:NodeShape;
>     >>>      sh:class gn:Feature;
>     >>>      sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI
>     gn:A.PCLD gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
>     >>>    ];
>     >>>    sh:property [sh:path gn:countryCode; sh:datatype
>     xsd:string; sh:minCount 1; sh:maxCount 1];
>     >>> .
>     >>>
>     >>> ex:ADM1Shape a sh:NodeShape;
>     >>>    sh:target [a sh:NodeShape;
>     >>>      sh:class gn:Feature;
>     >>>      sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
>     >>>    ];
>     >>>    sh:property [sh:path gn:parentCountry; sh:node
>     ex:CountryShape; sh:minCount 1; sh:maxCount 1];
>     >>> .
>     >>> ```
>     >>>
>     >>> ## Targeting and Reference Shapes
>     >>>
>     >>> In the last example we stated that `gn:parentCountry` must
>     point to something that satisfies `ex:CountryShape`.
>     >>> This means that every time we validate `ex:ADM1Shape`, we need
>     to validate its country (together with the country-specific
>     properties).
>     >>> So the validation of ADM1 must recurse into validation of Country.
>     >>>
>     >>> This is not always convenient since it's hard to control this
>     recursive process.
>     >>> Furthermore, if Country referred back to `ex:ADM1Shape` of its
>     regions, we'd have a recursive shape and the result would be
>     undefined.
>     >>>
>     >>> It may therefore be more convenient to check only the
>     **existence** of Country from ADM1,
>     >>> and depend that some other process will check the validity of
>     Country.
>     >>> We could do it like this:
>     >>>
>     >>> ```
>     >>> ex:CountryReferenceShape a sh:NodeShape;
>     >>>    sh:class gn:Feature;
>     >>>    sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI
>     gn:A.PCLD gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
>     >>> .
>     >>>
>     >>> ex:CountryShape a sh:NodeShape;
>     >>>    sh:target ex:CountryReferenceShape;
>     >>>    sh:property [sh:path gn:countryCode; sh:datatype
>     xsd:string; sh:minCount 1; sh:maxCount 1];
>     >>> .
>     >>>
>     >>> ex:ADM1ReferenceShape a sh:NodeShape;
>     >>>    sh:class gn:Feature;
>     >>>    sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
>     >>> .
>     >>>
>     >>> ex:ADM1Shape a sh:NodeShape;
>     >>>    sh:target ex:ADM1ReferenceShape;
>     >>>    sh:property [sh:path gn:parentCountry; sh:node
>     ex:CountryReferenceShape; sh:minCount 1; sh:maxCount 1];
>     >>> .
>     >>> ```
>     >>>
>     >>> The significant change is in the last line: ADM1 checks
>     `ex:CountryReferenceShape` rather than `ex:CountryShape`.
>     >>> And we reuse `ex:CountryReferenceShape` as both:
>     >>> - Existence check in `ex:ADM1Shape`
>     >>> - Targeting shape in `ex:CountryShape`
>     >>>
>     >>> ## Politicians and Parties
>     >>>
>     >>> Let's say every Party has at least one Politician,
>     >>> every Politician belongs to exactly one Party (ok, that is
>     unrealistic),
>     >>> politicians are defined by a combination of `rdf:type` and
>     `dc:type`,
>     >>> and both Parties and Politicians adhere to one of two politics
>     (Democrat vs Republican).
>     >>>
>     >>> If we model this with two shapes that refer to each other,
>     we'd have recursive shapes.
>     >>> So again we use two shapes for every entity:
>     >>> - A "smaller" ReferenceShape that just checks existence in
>     terms of "semantic type discrimination"
>     >>> - A "bigger" Shape that checks all other properties of the
>     instance, and uses the ReferenceShape for targeting
>     >>>
>     >>> This eliminates the recursion.
>     >>>
>     >>> ```
>     >>> ex:PoliticianReferenceShape a sh:NodeShape;
>     >>>    sh:property [sh:path rdf:type; sh:in (foaf:Person dbo:Person)];
>     >>>    sh:property [sh:path dc:type; sh:hasValue "politician"];
>     >>> .
>     >>> ex:PoliticianShape a sh:NodeShape;
>     >>>    sh:target ex:PoliticianReferenceShape;
>     >>>    sh:property [sh:path ex:politics; sh:in ("Democrat"
>     "Republican")];
>     >>>    sh:property [sh:path ex:party; sh:node
>     ex:PartyReferenceShape; sh:minCount 1; sh:maxCount 1];
>     >>> .
>     >>> ex:PartyReference a sh:NodeShape;
>     >>>    sh:property [sh:path rdf:type; sh:hasValue foaf:Organization];
>     >>>    sh:property [sh:path dc:type; sh:hasValue "political party"];
>     >>> .
>     >>> ex:PartyShape a sh:NodeShape;
>     >>>    sh:target ex:PartyReferenceShape;
>     >>>    sh:property [sh:path ex:politics; sh:in ("Democrat"
>     "Republican")];
>     >>>    sh:property [sh:path ex:politician; sh:node
>     ex:PoliticianReferenceShape; sh:minCount 1];
>     >>> .
>     >>> ```
>
Received on Friday, 5 June 2020 07:31:32 UTC