Re: SHACL target extension from Håvard Ottestad on 2020-06-05 (public-shacl@w3.org from June 2020)

From: Håvard Ottestad <hmottestad@gmail.com>
Date: Fri, 5 Jun 2020 10:00:26 +0200
To: Holger Knublauch <holger@topquadrant.com>
Cc: Public Shacl W3C <public-shacl@w3.org>
Message-ID: <CAEKmdN0Wto40Bt=UUYey8knt962Gs0X-580_k4=_Xs6Ftdu50g@mail.gmail.com>
We are planning on generating a single SPARQL query for that case. We
haven't started working on this yet. Our plan is to have two approaches: we
analyze the transaction and estimate the cost of a "transactional"
validation and a "full SPARQL" validation and run whichever is faster for
that particular transaction.

Wouldn't dash:AllSubject and dash:AllObjects be just as slow, or
sh:targetClass rdfs:Resource for that matter? How would these be optimized
better than a target shape representing "all objects that match a regex
pattern"?

Håvard

On Fri, Jun 5, 2020 at 9:31 AM Holger Knublauch <holger@topquadrant.com>
wrote:

>
> On 5/06/2020 17:02, Håvard Ottestad wrote:
>
> >Ok, does this apply to the case where you have a target shape and want
> to find all nodes in the graph that conform?
>
> Yes. All those are trivial as target shapes.
>
> For the example below the data added by the user becomes the starting
> point
>
> Ok, that's of course easier because you already have a small subset of
> node. But then we are not talking about the same use case. What happens if
> you need to run the full validation of the full graph? E.g. someone puts a
> sh:pattern on rdfs:label and there are (which is realistic) millions of
> labels in the database?
>
> Holger
>
>
> for the validation. A target is either added in this transaction, in which
> case we retrieve all its foaf:age paths and validate those. Or a path is
> added to an existing target, in which case we have a node to start on (the
> subject of the path).
>
> ex:CompanyShape a sh:NodeShape;
>   sh:target [a sh:NodeShape;
>     sh:nodeKind sh:IRI;
>     sh:pattern "^https://company-graph.ontotext.com/resource/company/";
>   ];
>   sh:property [sh:path foaf:age; sh:datatype xsd:integer ];
> .
>
> Håvard
>
> On Fri, Jun 5, 2020 at 8:28 AM Holger Knublauch <holger@topquadrant.com>
> wrote:
>
>>
>> On 5/06/2020 15:49, Håvard Ottestad wrote:
>> > Hi,
>> >
>> > Just a quick response performance wise.
>> >
>> > SPARQL targets are very slow because the RDF4J ShaclSail can’t analyze
>> a transaction to decide what to validate. Shape based targets on the other
>> hand can be used to generate a validation plan that utilizes the changeset
>> of the transaction to only validate a small subset of the data.
>> The design of named SPARQL targets means that if a name gets established
>> (e.g. as a de-facto standard) then an engine may hard-code it. However,
>> the SPARQL remains as a fallback.
>> >
>> > The more complex the target shape the larger this subset becomes and
>> the more data needs to be considered.
>> >
>> > For us sh:datatype, sh:nodeKind, sh:minExclusive etc, sh:minLength etc,
>> sh:pattern, sh:languageIn, sh:uniqueLang are actually trivial to validate
>> when used in single predicate path shapes.
>>
>> Ok, does this apply to the case where you have a target shape and want
>> to find all nodes in the graph that conform?
>>
>> Holger
>>
>>
>> >
>> > We are currently supporting sh:hasValue, sh:or, sh:and, sh:property and
>> sh:path as long as the effective path is a single predicate (so no nested
>> sh:property).
>> >
>> > Håvard
>> >
>> >> On 5 Jun 2020, at 04:30, Holger Knublauch <holger@topquadrant.com>
>> wrote:
>> >>
>> >> Hi Vladimir,
>> >>
>> >> from a specification point of view I see no show stoppers to
>> introducing such a mechanism. I would however introduce a new property
>> instead of sh:target, because the meaning of sh:target would otherwise be
>> overloaded and it is possible for targets to also be sh:NodeShapes in which
>> case the result will be very surprising. So, IMHO it should be something
>> like sh:targetShape (or the earlier, verbose sh:targetNodesConforming).
>> >>
>> >>  From a practical point of view, I remain very nervous about
>> performance implications. It will be too easy for users to produce some
>> really inefficient scenarios where any implementation almost certainly must
>> iterate over all nodes in the whole graph.E.g. sh:targetShape [ sh:datatype
>> xsd:string ] requires walking through all existing objects in the graph,
>> likewise something with sh:languageIn or sh:pattern.
>> >>
>> >> If we offer such a feature then we may invite disappointment from
>> users, and statements such as "SHACL is slow". Sometimes less is more. Note
>> that any sh:targetShape statement means that even a simple check such as
>> "is node N in the target of S" requires iterating over all sh:targetShapes
>> each time. This can be very expensive.
>> >>
>> >> The implementation cost of this feature is significant, because it
>> requires the implementation of an "inverse validation" algorithm.
>> Validation starts with a focus node and returns a result. The inverse would
>> start with the shape and has to discover the valid focus nodes. For
>> example, in the case of sh:targetShape [ sh:class X ; sh:property [ sh:path
>> p ; sh:hasValue Z ] ] an algorithm has the choice between first looping
>> over all instances of X and then checking if they have Z or vice versa.
>> Yes, it's an opportunity for developing interesting algorithms, and such an
>> inverse validation algorithm would be beneficial and interesting for many
>> use cases anyway. I personally can at the moment not commit time for such
>> an algorithm so I would, in order to fulfill such a spec, introduce a
>> painfully slow brute-force algorithm. Other implementers may be in the same
>> boat, raising the bar for implementers significantly.
>> >>
>> >> Meanwhile, SPARQL-based targets already exist, and give users control
>> over how efficient the implementation will be able to understand them. For
>> example, such a target could just be "SELECT ?this WHERE { ?this
>> ex:nationality ex:Norwagian }" and any off-the-shelf SPARQL engine can be
>> used to evaluate that.
>> >>
>> >> So while I agree with the use case, and the fact that this might be
>> more direct than sh:filterShape (which has its own problems), I am quite
>> nervous that we are over-promising here.
>> >>
>> >> Do you guys already have implementations of such inverse validation
>> algorithms?
>> >>
>> >> ---
>> >>
>> >> Here is another thought: looking through the Core constraint types, I
>> guess most of them are hard to execute in the inverse order: sh:datatype,
>> sh:nodeKind, sh:minExclusive etc, sh:minLength etc, sh:pattern,
>> sh:languageIn, sh:uniqueLang, sh:lessThan etc, sh:closed, and also the
>> XYcount ones all basically require walking through all subjects and objects
>> in the graph. However, the following are quite easy to revert:
>> >>
>> >> - sh:class (= sh:targetClass)
>> >> - sh:hasValue
>> >> - sh:in
>> >>
>> >> So what if we simply introduce a new target type sh:targetHasValue V
>> where the targets can be identified by a direct look-up. For example
>> >>
>> >> ex:KiwiShape
>> >>      sh:targetHasValue [
>> >>          sh:path ex:nationality ;
>> >>          sh:hasValue ex:NewZealand ;
>> >>      ] ; ...
>> >>
>> >> which amounts to asking ?this ex:nationality ex:NewZealand which is
>> super fast and covers both sh:hasValue and (to lesser extent) sh:in use
>> cases. In fact, such a thing can be easily expressed as a SHACL-SPARQL
>> target type already, and the syntax could be
>> >>
>> >> ex:KiwiShape
>> >>      sh:target [
>> >>          a dash:HasValueTarget ;
>> >>          dash:predicate ex:nationality ;
>> >>          dash:value ex:NewZealand ;
>> >>      ] ; ...
>> >>
>> >> and the underlying SPARQL query would be
>> >>
>> >> SELECT ?this
>> >> WHERE {
>> >>      ?this $predicate $value .
>> >> }
>> >>
>> >> This wouldn't cover all use cases mentioned here, but at least covers
>> the hasValue scenario, and nothing new needs to be implemented or added to
>> the spec.
>> >>
>> >> Holger
>> >>
>> >>
>> >>> On 4/06/2020 19:31, Vladimir Alexiev wrote:
>> >>> Hi everyone! (This email is formatted as markdown)
>> >>>
>> >>> I have 2 objections to earlier proposals:
>> >>> - According to
>> https://www.w3.org/TR/shacl-af/#node-expressions-filter-shape,
>> >>>    `sh:filterShape` is always used with `$this` as seed and
>> `sh:nodes` as generator.
>> >>>    So I don't think it can be used for our case.
>> >>> - It seems wrong to me to use `sh:target` and `sh:filterShape` in a
>> disconnected manner
>> >>>    (the former with just marker classes, the latter to carry the
>> actual target shape)
>> >>>
>> >>> I thought more about what Holger called `sh:targetNodesConforming`,
>> and I think what we need already exists: target by `NodeShape`.
>> >>> So I think we only need to add a new subsection of
>> https://www.w3.org/TR/shacl-af/#targets but no new classes or properties.
>> >>>
>> >>>> Separating sh:AllSubjects and sh:AllObjects separately would offer
>> more flexibility too
>> >>> Both subjects and objects are Nodes in the graph.
>> >>> I think `NodeShape` already gives us enough flexibility to select one
>> or the other
>> >>> (there are 2 related examples below: selecting by IRI pattern, and
>> selecting langString literals).
>> >>> Just like we don't have distinct `SubjectNodeShape` vs
>> `ObjectNodeShape`,
>> >>> I don't think we need such distinction for targeting either.
>> >>>
>> >>> Below is a proposal for such new subsection, please comment.
>> >>>
>> >>> # NodeShape Targets
>> >>>
>> >>> Sometimes it is useful to find nodes by shape, and then validate them
>> using another shape.
>> >>> To do this, you can use `sh:target` that is a `sh:NodeShape`:
>> >>>
>> >>> ```
>> >>> ex:MyNodeShape a sh:NodeShape;
>> >>>    sh:target [a sh:NodeShape;
>> >>>      <NodeShape constructs for target>
>> >>>    ];
>> >>>    <NodeShape constructs for validation>
>> >>> .
>> >>> ```
>> >>>
>> >>> In the following subsections we show several examples of this design.
>> >>>
>> >>> ## Target by Property and Object
>> >>>
>> >>> Norwegians must have one norwegianID:
>> >>>
>> >>> ```
>> >>> ex:NorwegianShape a sh:NodeShape;
>> >>>    sh:target [a sh:NodeShape;
>> >>>      sh:property [sh:path ex:nationality; sh:hasValue ex:Norway];
>> >>>    ];
>> >>>    sh:property [sh:path ex:norwegianID; sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>> ```
>> >>>
>> >>> ## Target Namespace Instances
>> >>>
>> >>> All instances in a given namespace must have a certain shape:
>> >>>
>> >>> ```
>> >>> ex:CompanyShape a sh:NodeShape;
>> >>>    sh:target [a sh:NodeShape;
>> >>>      sh:nodeKind sh:IRI;
>> >>>      sh:pattern "^
>> https://company-graph.ontotext.com/resource/company/";
>> >>>    ];
>> >>>    sh:class ex:Company;
>> >>>    sh:property [sh:path dc:type; sh:in ("conglomerate" "collective"
>> "enterprise")];
>> >>> .
>> >>> ```
>> >>>
>> >>> ## Target All langStrings
>> >>>
>> >>> All langStrings must have one of a predefind set of languages:
>> >>>
>> >>> ```
>> >>> ex:langStringShape a sh:NodeShape;
>> >>>    sh:target [a sh:NodeShape;
>> >>>      sh:datatype rdf:langString;
>> >>>    ];
>> >>>    sh:languageIn ("en" "bg");
>> >>> .
>> >>> ```
>> >>>
>> >>> ## Target By Cardinality
>> >>>
>> >>> Let's say a person Steve is very popular, so everyone who knows at
>> least three people must know Steve:
>> >>> ```
>> >>> ex:Personshape a sh:NodeShape;
>> >>>    sh:target [a sh:NodeShape;
>> >>>      sh:property [sh:path foaf:knows; sh:minCount 3];
>> >>>    ];
>> >>>    sh:property [sh:path foaf:knows; sh:hasValue ex:Steve];
>> >>> .
>> >>> ```
>> >>>
>> >>> ## Semantic Type Discrimination
>> >>>
>> >>> In some datasets, instances are not discriminated by `rdf:type`
>> alone, but also by other traits.
>> >>> Often more than one check needs to be performed.
>> >>>
>> >>> Eg in Geonames, all instances have type `gn:Feature`, and are further
>> discriminated by `gn:featureCode`.
>> >>> That's a 2-level classification of some 650 codes that includes
>> everything from continents to mountains to pipelines to hotels.
>> >>>
>> >>> Imagine that you're interested only in countries and top-level
>> administrative divisions (states, provinces and the like).
>> >>> - A bunch of codes correspond to the concept "country"
>> >>> - Countries have `gn:countryCode`
>> >>> - Only the code `gn:ADM1` corresponds to top-level administrative
>> divisions
>> >>> - Administrative divisions have `gn:parentCountry`
>> >>> (This does not describe all Geonames fields, only the ones that we
>> need.)
>> >>>
>> >>> ```
>> >>> gn:Feature a sh:NodeShape, rdf:Class;
>> >>>    # implicit: sh:targetClass gn:Feature;
>> >>>    sh:property [sh:path gn:name;         sh:datatype xsd:string;
>> sh:minCount 1; sh:maxCount 1];
>> >>>    sh:property [sh:path gn:featureClass; sh:nodeKind sh:IRI;
>> sh:minCount 1; sh:maxCount 1];
>> >>>    sh:property [sh:path gn:featureCode;  sh:nodeKind sh:IRI;
>> sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>>
>> >>> ex:CountryShape a sh:NodeShape;
>> >>>    sh:target [a sh:NodeShape;
>> >>>      sh:class gn:Feature;
>> >>>      sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD
>> gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
>> >>>    ];
>> >>>    sh:property [sh:path gn:countryCode; sh:datatype xsd:string;
>> sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>>
>> >>> ex:ADM1Shape a sh:NodeShape;
>> >>>    sh:target [a sh:NodeShape;
>> >>>      sh:class gn:Feature;
>> >>>      sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
>> >>>    ];
>> >>>    sh:property [sh:path gn:parentCountry; sh:node ex:CountryShape;
>> sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>> ```
>> >>>
>> >>> ## Targeting and Reference Shapes
>> >>>
>> >>> In the last example we stated that `gn:parentCountry` must point to
>> something that satisfies `ex:CountryShape`.
>> >>> This means that every time we validate `ex:ADM1Shape`, we need to
>> validate its country (together with the country-specific properties).
>> >>> So the validation of ADM1 must recurse into validation of Country.
>> >>>
>> >>> This is not always convenient since it's hard to control this
>> recursive process.
>> >>> Furthermore, if Country referred back to `ex:ADM1Shape` of its
>> regions, we'd have a recursive shape and the result would be undefined.
>> >>>
>> >>> It may therefore be more convenient to check only the **existence**
>> of Country from ADM1,
>> >>> and depend that some other process will check the validity of Country.
>> >>> We could do it like this:
>> >>>
>> >>> ```
>> >>> ex:CountryReferenceShape a sh:NodeShape;
>> >>>    sh:class gn:Feature;
>> >>>    sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD
>> gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
>> >>> .
>> >>>
>> >>> ex:CountryShape a sh:NodeShape;
>> >>>    sh:target ex:CountryReferenceShape;
>> >>>    sh:property [sh:path gn:countryCode; sh:datatype xsd:string;
>> sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>>
>> >>> ex:ADM1ReferenceShape a sh:NodeShape;
>> >>>    sh:class gn:Feature;
>> >>>    sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
>> >>> .
>> >>>
>> >>> ex:ADM1Shape a sh:NodeShape;
>> >>>    sh:target ex:ADM1ReferenceShape;
>> >>>    sh:property [sh:path gn:parentCountry; sh:node
>> ex:CountryReferenceShape; sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>> ```
>> >>>
>> >>> The significant change is in the last line: ADM1 checks
>> `ex:CountryReferenceShape` rather than `ex:CountryShape`.
>> >>> And we reuse `ex:CountryReferenceShape` as both:
>> >>> - Existence check in `ex:ADM1Shape`
>> >>> - Targeting shape in `ex:CountryShape`
>> >>>
>> >>> ## Politicians and Parties
>> >>>
>> >>> Let's say every Party has at least one Politician,
>> >>> every Politician belongs to exactly one Party (ok, that is
>> unrealistic),
>> >>> politicians are defined by a combination of `rdf:type` and `dc:type`,
>> >>> and both Parties and Politicians adhere to one of two politics
>> (Democrat vs Republican).
>> >>>
>> >>> If we model this with two shapes that refer to each other, we'd have
>> recursive shapes.
>> >>> So again we use two shapes for every entity:
>> >>> - A "smaller" ReferenceShape that just checks existence in terms of
>> "semantic type discrimination"
>> >>> - A "bigger" Shape that checks all other properties of the instance,
>> and uses the ReferenceShape for targeting
>> >>>
>> >>> This eliminates the recursion.
>> >>>
>> >>> ```
>> >>> ex:PoliticianReferenceShape a sh:NodeShape;
>> >>>    sh:property [sh:path rdf:type; sh:in (foaf:Person dbo:Person)];
>> >>>    sh:property [sh:path dc:type; sh:hasValue "politician"];
>> >>> .
>> >>> ex:PoliticianShape a sh:NodeShape;
>> >>>    sh:target ex:PoliticianReferenceShape;
>> >>>    sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")];
>> >>>    sh:property [sh:path ex:party; sh:node ex:PartyReferenceShape;
>> sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>> ex:PartyReference a sh:NodeShape;
>> >>>    sh:property [sh:path rdf:type; sh:hasValue foaf:Organization];
>> >>>    sh:property [sh:path dc:type; sh:hasValue "political party"];
>> >>> .
>> >>> ex:PartyShape a sh:NodeShape;
>> >>>    sh:target ex:PartyReferenceShape;
>> >>>    sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")];
>> >>>    sh:property [sh:path ex:politician; sh:node
>> ex:PoliticianReferenceShape; sh:minCount 1];
>> >>> .
>> >>> ```
>>
>
Received on Friday, 5 June 2020 08:00:53 UTC