- From: Håvard Ottestad <hmottestad@gmail.com>
- Date: Fri, 5 Jun 2020 09:02:39 +0200
- To: Holger Knublauch <holger@topquadrant.com>
- Cc: Public Shacl W3C <public-shacl@w3.org>
- Message-ID: <CAEKmdN2ekUutoLA=a0Y318=nJfkJih9WjT=2+t46hVi5jYnbAw@mail.gmail.com>
>Ok, does this apply to the case where you have a target shape and want to find all nodes in the graph that conform? Yes. All those are trivial as target shapes. For the example below the data added by the user becomes the starting point for the validation. A target is either added in this transaction, in which case we retrieve all its foaf:age paths and validate those. Or a path is added to an existing target, in which case we have a node to start on (the subject of the path). ex:CompanyShape a sh:NodeShape; sh:target [a sh:NodeShape; sh:nodeKind sh:IRI; sh:pattern "^https://company-graph.ontotext.com/resource/company/"; ]; sh:property [sh:path foaf:age; sh:datatype xsd:integer ]; . Håvard On Fri, Jun 5, 2020 at 8:28 AM Holger Knublauch <holger@topquadrant.com> wrote: > > On 5/06/2020 15:49, Håvard Ottestad wrote: > > Hi, > > > > Just a quick response performance wise. > > > > SPARQL targets are very slow because the RDF4J ShaclSail can’t analyze a > transaction to decide what to validate. Shape based targets on the other > hand can be used to generate a validation plan that utilizes the changeset > of the transaction to only validate a small subset of the data. > The design of named SPARQL targets means that if a name gets established > (e.g. as a de-facto standard) then an engine may hard-code it. However, > the SPARQL remains as a fallback. > > > > The more complex the target shape the larger this subset becomes and the > more data needs to be considered. > > > > For us sh:datatype, sh:nodeKind, sh:minExclusive etc, sh:minLength etc, > sh:pattern, sh:languageIn, sh:uniqueLang are actually trivial to validate > when used in single predicate path shapes. > > Ok, does this apply to the case where you have a target shape and want > to find all nodes in the graph that conform? > > Holger > > > > > > We are currently supporting sh:hasValue, sh:or, sh:and, sh:property and > sh:path as long as the effective path is a single predicate (so no nested > sh:property). > > > > Håvard > > > >> On 5 Jun 2020, at 04:30, Holger Knublauch <holger@topquadrant.com> > wrote: > >> > >> Hi Vladimir, > >> > >> from a specification point of view I see no show stoppers to > introducing such a mechanism. I would however introduce a new property > instead of sh:target, because the meaning of sh:target would otherwise be > overloaded and it is possible for targets to also be sh:NodeShapes in which > case the result will be very surprising. So, IMHO it should be something > like sh:targetShape (or the earlier, verbose sh:targetNodesConforming). > >> > >> From a practical point of view, I remain very nervous about > performance implications. It will be too easy for users to produce some > really inefficient scenarios where any implementation almost certainly must > iterate over all nodes in the whole graph.E.g. sh:targetShape [ sh:datatype > xsd:string ] requires walking through all existing objects in the graph, > likewise something with sh:languageIn or sh:pattern. > >> > >> If we offer such a feature then we may invite disappointment from > users, and statements such as "SHACL is slow". Sometimes less is more. Note > that any sh:targetShape statement means that even a simple check such as > "is node N in the target of S" requires iterating over all sh:targetShapes > each time. This can be very expensive. > >> > >> The implementation cost of this feature is significant, because it > requires the implementation of an "inverse validation" algorithm. > Validation starts with a focus node and returns a result. The inverse would > start with the shape and has to discover the valid focus nodes. For > example, in the case of sh:targetShape [ sh:class X ; sh:property [ sh:path > p ; sh:hasValue Z ] ] an algorithm has the choice between first looping > over all instances of X and then checking if they have Z or vice versa. > Yes, it's an opportunity for developing interesting algorithms, and such an > inverse validation algorithm would be beneficial and interesting for many > use cases anyway. I personally can at the moment not commit time for such > an algorithm so I would, in order to fulfill such a spec, introduce a > painfully slow brute-force algorithm. Other implementers may be in the same > boat, raising the bar for implementers significantly. > >> > >> Meanwhile, SPARQL-based targets already exist, and give users control > over how efficient the implementation will be able to understand them. For > example, such a target could just be "SELECT ?this WHERE { ?this > ex:nationality ex:Norwagian }" and any off-the-shelf SPARQL engine can be > used to evaluate that. > >> > >> So while I agree with the use case, and the fact that this might be > more direct than sh:filterShape (which has its own problems), I am quite > nervous that we are over-promising here. > >> > >> Do you guys already have implementations of such inverse validation > algorithms? > >> > >> --- > >> > >> Here is another thought: looking through the Core constraint types, I > guess most of them are hard to execute in the inverse order: sh:datatype, > sh:nodeKind, sh:minExclusive etc, sh:minLength etc, sh:pattern, > sh:languageIn, sh:uniqueLang, sh:lessThan etc, sh:closed, and also the > XYcount ones all basically require walking through all subjects and objects > in the graph. However, the following are quite easy to revert: > >> > >> - sh:class (= sh:targetClass) > >> - sh:hasValue > >> - sh:in > >> > >> So what if we simply introduce a new target type sh:targetHasValue V > where the targets can be identified by a direct look-up. For example > >> > >> ex:KiwiShape > >> sh:targetHasValue [ > >> sh:path ex:nationality ; > >> sh:hasValue ex:NewZealand ; > >> ] ; ... > >> > >> which amounts to asking ?this ex:nationality ex:NewZealand which is > super fast and covers both sh:hasValue and (to lesser extent) sh:in use > cases. In fact, such a thing can be easily expressed as a SHACL-SPARQL > target type already, and the syntax could be > >> > >> ex:KiwiShape > >> sh:target [ > >> a dash:HasValueTarget ; > >> dash:predicate ex:nationality ; > >> dash:value ex:NewZealand ; > >> ] ; ... > >> > >> and the underlying SPARQL query would be > >> > >> SELECT ?this > >> WHERE { > >> ?this $predicate $value . > >> } > >> > >> This wouldn't cover all use cases mentioned here, but at least covers > the hasValue scenario, and nothing new needs to be implemented or added to > the spec. > >> > >> Holger > >> > >> > >>> On 4/06/2020 19:31, Vladimir Alexiev wrote: > >>> Hi everyone! (This email is formatted as markdown) > >>> > >>> I have 2 objections to earlier proposals: > >>> - According to > https://www.w3.org/TR/shacl-af/#node-expressions-filter-shape, > >>> `sh:filterShape` is always used with `$this` as seed and `sh:nodes` > as generator. > >>> So I don't think it can be used for our case. > >>> - It seems wrong to me to use `sh:target` and `sh:filterShape` in a > disconnected manner > >>> (the former with just marker classes, the latter to carry the > actual target shape) > >>> > >>> I thought more about what Holger called `sh:targetNodesConforming`, > and I think what we need already exists: target by `NodeShape`. > >>> So I think we only need to add a new subsection of > https://www.w3.org/TR/shacl-af/#targets but no new classes or properties. > >>> > >>>> Separating sh:AllSubjects and sh:AllObjects separately would offer > more flexibility too > >>> Both subjects and objects are Nodes in the graph. > >>> I think `NodeShape` already gives us enough flexibility to select one > or the other > >>> (there are 2 related examples below: selecting by IRI pattern, and > selecting langString literals). > >>> Just like we don't have distinct `SubjectNodeShape` vs > `ObjectNodeShape`, > >>> I don't think we need such distinction for targeting either. > >>> > >>> Below is a proposal for such new subsection, please comment. > >>> > >>> # NodeShape Targets > >>> > >>> Sometimes it is useful to find nodes by shape, and then validate them > using another shape. > >>> To do this, you can use `sh:target` that is a `sh:NodeShape`: > >>> > >>> ``` > >>> ex:MyNodeShape a sh:NodeShape; > >>> sh:target [a sh:NodeShape; > >>> <NodeShape constructs for target> > >>> ]; > >>> <NodeShape constructs for validation> > >>> . > >>> ``` > >>> > >>> In the following subsections we show several examples of this design. > >>> > >>> ## Target by Property and Object > >>> > >>> Norwegians must have one norwegianID: > >>> > >>> ``` > >>> ex:NorwegianShape a sh:NodeShape; > >>> sh:target [a sh:NodeShape; > >>> sh:property [sh:path ex:nationality; sh:hasValue ex:Norway]; > >>> ]; > >>> sh:property [sh:path ex:norwegianID; sh:minCount 1; sh:maxCount 1]; > >>> . > >>> ``` > >>> > >>> ## Target Namespace Instances > >>> > >>> All instances in a given namespace must have a certain shape: > >>> > >>> ``` > >>> ex:CompanyShape a sh:NodeShape; > >>> sh:target [a sh:NodeShape; > >>> sh:nodeKind sh:IRI; > >>> sh:pattern "^https://company-graph.ontotext.com/resource/company/ > "; > >>> ]; > >>> sh:class ex:Company; > >>> sh:property [sh:path dc:type; sh:in ("conglomerate" "collective" > "enterprise")]; > >>> . > >>> ``` > >>> > >>> ## Target All langStrings > >>> > >>> All langStrings must have one of a predefind set of languages: > >>> > >>> ``` > >>> ex:langStringShape a sh:NodeShape; > >>> sh:target [a sh:NodeShape; > >>> sh:datatype rdf:langString; > >>> ]; > >>> sh:languageIn ("en" "bg"); > >>> . > >>> ``` > >>> > >>> ## Target By Cardinality > >>> > >>> Let's say a person Steve is very popular, so everyone who knows at > least three people must know Steve: > >>> ``` > >>> ex:Personshape a sh:NodeShape; > >>> sh:target [a sh:NodeShape; > >>> sh:property [sh:path foaf:knows; sh:minCount 3]; > >>> ]; > >>> sh:property [sh:path foaf:knows; sh:hasValue ex:Steve]; > >>> . > >>> ``` > >>> > >>> ## Semantic Type Discrimination > >>> > >>> In some datasets, instances are not discriminated by `rdf:type` alone, > but also by other traits. > >>> Often more than one check needs to be performed. > >>> > >>> Eg in Geonames, all instances have type `gn:Feature`, and are further > discriminated by `gn:featureCode`. > >>> That's a 2-level classification of some 650 codes that includes > everything from continents to mountains to pipelines to hotels. > >>> > >>> Imagine that you're interested only in countries and top-level > administrative divisions (states, provinces and the like). > >>> - A bunch of codes correspond to the concept "country" > >>> - Countries have `gn:countryCode` > >>> - Only the code `gn:ADM1` corresponds to top-level administrative > divisions > >>> - Administrative divisions have `gn:parentCountry` > >>> (This does not describe all Geonames fields, only the ones that we > need.) > >>> > >>> ``` > >>> gn:Feature a sh:NodeShape, rdf:Class; > >>> # implicit: sh:targetClass gn:Feature; > >>> sh:property [sh:path gn:name; sh:datatype xsd:string; > sh:minCount 1; sh:maxCount 1]; > >>> sh:property [sh:path gn:featureClass; sh:nodeKind sh:IRI; > sh:minCount 1; sh:maxCount 1]; > >>> sh:property [sh:path gn:featureCode; sh:nodeKind sh:IRI; > sh:minCount 1; sh:maxCount 1]; > >>> . > >>> > >>> ex:CountryShape a sh:NodeShape; > >>> sh:target [a sh:NodeShape; > >>> sh:class gn:Feature; > >>> sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD > gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)]; > >>> ]; > >>> sh:property [sh:path gn:countryCode; sh:datatype xsd:string; > sh:minCount 1; sh:maxCount 1]; > >>> . > >>> > >>> ex:ADM1Shape a sh:NodeShape; > >>> sh:target [a sh:NodeShape; > >>> sh:class gn:Feature; > >>> sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1]; > >>> ]; > >>> sh:property [sh:path gn:parentCountry; sh:node ex:CountryShape; > sh:minCount 1; sh:maxCount 1]; > >>> . > >>> ``` > >>> > >>> ## Targeting and Reference Shapes > >>> > >>> In the last example we stated that `gn:parentCountry` must point to > something that satisfies `ex:CountryShape`. > >>> This means that every time we validate `ex:ADM1Shape`, we need to > validate its country (together with the country-specific properties). > >>> So the validation of ADM1 must recurse into validation of Country. > >>> > >>> This is not always convenient since it's hard to control this > recursive process. > >>> Furthermore, if Country referred back to `ex:ADM1Shape` of its > regions, we'd have a recursive shape and the result would be undefined. > >>> > >>> It may therefore be more convenient to check only the **existence** of > Country from ADM1, > >>> and depend that some other process will check the validity of Country. > >>> We could do it like this: > >>> > >>> ``` > >>> ex:CountryReferenceShape a sh:NodeShape; > >>> sh:class gn:Feature; > >>> sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD > gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)]; > >>> . > >>> > >>> ex:CountryShape a sh:NodeShape; > >>> sh:target ex:CountryReferenceShape; > >>> sh:property [sh:path gn:countryCode; sh:datatype xsd:string; > sh:minCount 1; sh:maxCount 1]; > >>> . > >>> > >>> ex:ADM1ReferenceShape a sh:NodeShape; > >>> sh:class gn:Feature; > >>> sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1]; > >>> . > >>> > >>> ex:ADM1Shape a sh:NodeShape; > >>> sh:target ex:ADM1ReferenceShape; > >>> sh:property [sh:path gn:parentCountry; sh:node > ex:CountryReferenceShape; sh:minCount 1; sh:maxCount 1]; > >>> . > >>> ``` > >>> > >>> The significant change is in the last line: ADM1 checks > `ex:CountryReferenceShape` rather than `ex:CountryShape`. > >>> And we reuse `ex:CountryReferenceShape` as both: > >>> - Existence check in `ex:ADM1Shape` > >>> - Targeting shape in `ex:CountryShape` > >>> > >>> ## Politicians and Parties > >>> > >>> Let's say every Party has at least one Politician, > >>> every Politician belongs to exactly one Party (ok, that is > unrealistic), > >>> politicians are defined by a combination of `rdf:type` and `dc:type`, > >>> and both Parties and Politicians adhere to one of two politics > (Democrat vs Republican). > >>> > >>> If we model this with two shapes that refer to each other, we'd have > recursive shapes. > >>> So again we use two shapes for every entity: > >>> - A "smaller" ReferenceShape that just checks existence in terms of > "semantic type discrimination" > >>> - A "bigger" Shape that checks all other properties of the instance, > and uses the ReferenceShape for targeting > >>> > >>> This eliminates the recursion. > >>> > >>> ``` > >>> ex:PoliticianReferenceShape a sh:NodeShape; > >>> sh:property [sh:path rdf:type; sh:in (foaf:Person dbo:Person)]; > >>> sh:property [sh:path dc:type; sh:hasValue "politician"]; > >>> . > >>> ex:PoliticianShape a sh:NodeShape; > >>> sh:target ex:PoliticianReferenceShape; > >>> sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")]; > >>> sh:property [sh:path ex:party; sh:node ex:PartyReferenceShape; > sh:minCount 1; sh:maxCount 1]; > >>> . > >>> ex:PartyReference a sh:NodeShape; > >>> sh:property [sh:path rdf:type; sh:hasValue foaf:Organization]; > >>> sh:property [sh:path dc:type; sh:hasValue "political party"]; > >>> . > >>> ex:PartyShape a sh:NodeShape; > >>> sh:target ex:PartyReferenceShape; > >>> sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")]; > >>> sh:property [sh:path ex:politician; sh:node > ex:PoliticianReferenceShape; sh:minCount 1]; > >>> . > >>> ``` >
Received on Friday, 5 June 2020 07:03:06 UTC