- From: Håvard Ottestad <hmottestad@gmail.com>
- Date: Fri, 5 Jun 2020 10:00:26 +0200
- To: Holger Knublauch <holger@topquadrant.com>
- Cc: Public Shacl W3C <public-shacl@w3.org>
- Message-ID: <CAEKmdN0Wto40Bt=UUYey8knt962Gs0X-580_k4=_Xs6Ftdu50g@mail.gmail.com>
We are planning on generating a single SPARQL query for that case. We haven't started working on this yet. Our plan is to have two approaches: we analyze the transaction and estimate the cost of a "transactional" validation and a "full SPARQL" validation and run whichever is faster for that particular transaction. Wouldn't dash:AllSubject and dash:AllObjects be just as slow, or sh:targetClass rdfs:Resource for that matter? How would these be optimized better than a target shape representing "all objects that match a regex pattern"? Håvard On Fri, Jun 5, 2020 at 9:31 AM Holger Knublauch <holger@topquadrant.com> wrote: > > On 5/06/2020 17:02, Håvard Ottestad wrote: > > >Ok, does this apply to the case where you have a target shape and want > to find all nodes in the graph that conform? > > Yes. All those are trivial as target shapes. > > For the example below the data added by the user becomes the starting > point > > Ok, that's of course easier because you already have a small subset of > node. But then we are not talking about the same use case. What happens if > you need to run the full validation of the full graph? E.g. someone puts a > sh:pattern on rdfs:label and there are (which is realistic) millions of > labels in the database? > > Holger > > > for the validation. A target is either added in this transaction, in which > case we retrieve all its foaf:age paths and validate those. Or a path is > added to an existing target, in which case we have a node to start on (the > subject of the path). > > ex:CompanyShape a sh:NodeShape; > sh:target [a sh:NodeShape; > sh:nodeKind sh:IRI; > sh:pattern "^https://company-graph.ontotext.com/resource/company/"; > ]; > sh:property [sh:path foaf:age; sh:datatype xsd:integer ]; > . > > Håvard > > On Fri, Jun 5, 2020 at 8:28 AM Holger Knublauch <holger@topquadrant.com> > wrote: > >> >> On 5/06/2020 15:49, Håvard Ottestad wrote: >> > Hi, >> > >> > Just a quick response performance wise. >> > >> > SPARQL targets are very slow because the RDF4J ShaclSail can’t analyze >> a transaction to decide what to validate. Shape based targets on the other >> hand can be used to generate a validation plan that utilizes the changeset >> of the transaction to only validate a small subset of the data. >> The design of named SPARQL targets means that if a name gets established >> (e.g. as a de-facto standard) then an engine may hard-code it. However, >> the SPARQL remains as a fallback. >> > >> > The more complex the target shape the larger this subset becomes and >> the more data needs to be considered. >> > >> > For us sh:datatype, sh:nodeKind, sh:minExclusive etc, sh:minLength etc, >> sh:pattern, sh:languageIn, sh:uniqueLang are actually trivial to validate >> when used in single predicate path shapes. >> >> Ok, does this apply to the case where you have a target shape and want >> to find all nodes in the graph that conform? >> >> Holger >> >> >> > >> > We are currently supporting sh:hasValue, sh:or, sh:and, sh:property and >> sh:path as long as the effective path is a single predicate (so no nested >> sh:property). >> > >> > Håvard >> > >> >> On 5 Jun 2020, at 04:30, Holger Knublauch <holger@topquadrant.com> >> wrote: >> >> >> >> Hi Vladimir, >> >> >> >> from a specification point of view I see no show stoppers to >> introducing such a mechanism. I would however introduce a new property >> instead of sh:target, because the meaning of sh:target would otherwise be >> overloaded and it is possible for targets to also be sh:NodeShapes in which >> case the result will be very surprising. So, IMHO it should be something >> like sh:targetShape (or the earlier, verbose sh:targetNodesConforming). >> >> >> >> From a practical point of view, I remain very nervous about >> performance implications. It will be too easy for users to produce some >> really inefficient scenarios where any implementation almost certainly must >> iterate over all nodes in the whole graph.E.g. sh:targetShape [ sh:datatype >> xsd:string ] requires walking through all existing objects in the graph, >> likewise something with sh:languageIn or sh:pattern. >> >> >> >> If we offer such a feature then we may invite disappointment from >> users, and statements such as "SHACL is slow". Sometimes less is more. Note >> that any sh:targetShape statement means that even a simple check such as >> "is node N in the target of S" requires iterating over all sh:targetShapes >> each time. This can be very expensive. >> >> >> >> The implementation cost of this feature is significant, because it >> requires the implementation of an "inverse validation" algorithm. >> Validation starts with a focus node and returns a result. The inverse would >> start with the shape and has to discover the valid focus nodes. For >> example, in the case of sh:targetShape [ sh:class X ; sh:property [ sh:path >> p ; sh:hasValue Z ] ] an algorithm has the choice between first looping >> over all instances of X and then checking if they have Z or vice versa. >> Yes, it's an opportunity for developing interesting algorithms, and such an >> inverse validation algorithm would be beneficial and interesting for many >> use cases anyway. I personally can at the moment not commit time for such >> an algorithm so I would, in order to fulfill such a spec, introduce a >> painfully slow brute-force algorithm. Other implementers may be in the same >> boat, raising the bar for implementers significantly. >> >> >> >> Meanwhile, SPARQL-based targets already exist, and give users control >> over how efficient the implementation will be able to understand them. For >> example, such a target could just be "SELECT ?this WHERE { ?this >> ex:nationality ex:Norwagian }" and any off-the-shelf SPARQL engine can be >> used to evaluate that. >> >> >> >> So while I agree with the use case, and the fact that this might be >> more direct than sh:filterShape (which has its own problems), I am quite >> nervous that we are over-promising here. >> >> >> >> Do you guys already have implementations of such inverse validation >> algorithms? >> >> >> >> --- >> >> >> >> Here is another thought: looking through the Core constraint types, I >> guess most of them are hard to execute in the inverse order: sh:datatype, >> sh:nodeKind, sh:minExclusive etc, sh:minLength etc, sh:pattern, >> sh:languageIn, sh:uniqueLang, sh:lessThan etc, sh:closed, and also the >> XYcount ones all basically require walking through all subjects and objects >> in the graph. However, the following are quite easy to revert: >> >> >> >> - sh:class (= sh:targetClass) >> >> - sh:hasValue >> >> - sh:in >> >> >> >> So what if we simply introduce a new target type sh:targetHasValue V >> where the targets can be identified by a direct look-up. For example >> >> >> >> ex:KiwiShape >> >> sh:targetHasValue [ >> >> sh:path ex:nationality ; >> >> sh:hasValue ex:NewZealand ; >> >> ] ; ... >> >> >> >> which amounts to asking ?this ex:nationality ex:NewZealand which is >> super fast and covers both sh:hasValue and (to lesser extent) sh:in use >> cases. In fact, such a thing can be easily expressed as a SHACL-SPARQL >> target type already, and the syntax could be >> >> >> >> ex:KiwiShape >> >> sh:target [ >> >> a dash:HasValueTarget ; >> >> dash:predicate ex:nationality ; >> >> dash:value ex:NewZealand ; >> >> ] ; ... >> >> >> >> and the underlying SPARQL query would be >> >> >> >> SELECT ?this >> >> WHERE { >> >> ?this $predicate $value . >> >> } >> >> >> >> This wouldn't cover all use cases mentioned here, but at least covers >> the hasValue scenario, and nothing new needs to be implemented or added to >> the spec. >> >> >> >> Holger >> >> >> >> >> >>> On 4/06/2020 19:31, Vladimir Alexiev wrote: >> >>> Hi everyone! (This email is formatted as markdown) >> >>> >> >>> I have 2 objections to earlier proposals: >> >>> - According to >> https://www.w3.org/TR/shacl-af/#node-expressions-filter-shape, >> >>> `sh:filterShape` is always used with `$this` as seed and >> `sh:nodes` as generator. >> >>> So I don't think it can be used for our case. >> >>> - It seems wrong to me to use `sh:target` and `sh:filterShape` in a >> disconnected manner >> >>> (the former with just marker classes, the latter to carry the >> actual target shape) >> >>> >> >>> I thought more about what Holger called `sh:targetNodesConforming`, >> and I think what we need already exists: target by `NodeShape`. >> >>> So I think we only need to add a new subsection of >> https://www.w3.org/TR/shacl-af/#targets but no new classes or properties. >> >>> >> >>>> Separating sh:AllSubjects and sh:AllObjects separately would offer >> more flexibility too >> >>> Both subjects and objects are Nodes in the graph. >> >>> I think `NodeShape` already gives us enough flexibility to select one >> or the other >> >>> (there are 2 related examples below: selecting by IRI pattern, and >> selecting langString literals). >> >>> Just like we don't have distinct `SubjectNodeShape` vs >> `ObjectNodeShape`, >> >>> I don't think we need such distinction for targeting either. >> >>> >> >>> Below is a proposal for such new subsection, please comment. >> >>> >> >>> # NodeShape Targets >> >>> >> >>> Sometimes it is useful to find nodes by shape, and then validate them >> using another shape. >> >>> To do this, you can use `sh:target` that is a `sh:NodeShape`: >> >>> >> >>> ``` >> >>> ex:MyNodeShape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> <NodeShape constructs for target> >> >>> ]; >> >>> <NodeShape constructs for validation> >> >>> . >> >>> ``` >> >>> >> >>> In the following subsections we show several examples of this design. >> >>> >> >>> ## Target by Property and Object >> >>> >> >>> Norwegians must have one norwegianID: >> >>> >> >>> ``` >> >>> ex:NorwegianShape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:property [sh:path ex:nationality; sh:hasValue ex:Norway]; >> >>> ]; >> >>> sh:property [sh:path ex:norwegianID; sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> ``` >> >>> >> >>> ## Target Namespace Instances >> >>> >> >>> All instances in a given namespace must have a certain shape: >> >>> >> >>> ``` >> >>> ex:CompanyShape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:nodeKind sh:IRI; >> >>> sh:pattern "^ >> https://company-graph.ontotext.com/resource/company/"; >> >>> ]; >> >>> sh:class ex:Company; >> >>> sh:property [sh:path dc:type; sh:in ("conglomerate" "collective" >> "enterprise")]; >> >>> . >> >>> ``` >> >>> >> >>> ## Target All langStrings >> >>> >> >>> All langStrings must have one of a predefind set of languages: >> >>> >> >>> ``` >> >>> ex:langStringShape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:datatype rdf:langString; >> >>> ]; >> >>> sh:languageIn ("en" "bg"); >> >>> . >> >>> ``` >> >>> >> >>> ## Target By Cardinality >> >>> >> >>> Let's say a person Steve is very popular, so everyone who knows at >> least three people must know Steve: >> >>> ``` >> >>> ex:Personshape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:property [sh:path foaf:knows; sh:minCount 3]; >> >>> ]; >> >>> sh:property [sh:path foaf:knows; sh:hasValue ex:Steve]; >> >>> . >> >>> ``` >> >>> >> >>> ## Semantic Type Discrimination >> >>> >> >>> In some datasets, instances are not discriminated by `rdf:type` >> alone, but also by other traits. >> >>> Often more than one check needs to be performed. >> >>> >> >>> Eg in Geonames, all instances have type `gn:Feature`, and are further >> discriminated by `gn:featureCode`. >> >>> That's a 2-level classification of some 650 codes that includes >> everything from continents to mountains to pipelines to hotels. >> >>> >> >>> Imagine that you're interested only in countries and top-level >> administrative divisions (states, provinces and the like). >> >>> - A bunch of codes correspond to the concept "country" >> >>> - Countries have `gn:countryCode` >> >>> - Only the code `gn:ADM1` corresponds to top-level administrative >> divisions >> >>> - Administrative divisions have `gn:parentCountry` >> >>> (This does not describe all Geonames fields, only the ones that we >> need.) >> >>> >> >>> ``` >> >>> gn:Feature a sh:NodeShape, rdf:Class; >> >>> # implicit: sh:targetClass gn:Feature; >> >>> sh:property [sh:path gn:name; sh:datatype xsd:string; >> sh:minCount 1; sh:maxCount 1]; >> >>> sh:property [sh:path gn:featureClass; sh:nodeKind sh:IRI; >> sh:minCount 1; sh:maxCount 1]; >> >>> sh:property [sh:path gn:featureCode; sh:nodeKind sh:IRI; >> sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> >> >>> ex:CountryShape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:class gn:Feature; >> >>> sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD >> gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)]; >> >>> ]; >> >>> sh:property [sh:path gn:countryCode; sh:datatype xsd:string; >> sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> >> >>> ex:ADM1Shape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:class gn:Feature; >> >>> sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1]; >> >>> ]; >> >>> sh:property [sh:path gn:parentCountry; sh:node ex:CountryShape; >> sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> ``` >> >>> >> >>> ## Targeting and Reference Shapes >> >>> >> >>> In the last example we stated that `gn:parentCountry` must point to >> something that satisfies `ex:CountryShape`. >> >>> This means that every time we validate `ex:ADM1Shape`, we need to >> validate its country (together with the country-specific properties). >> >>> So the validation of ADM1 must recurse into validation of Country. >> >>> >> >>> This is not always convenient since it's hard to control this >> recursive process. >> >>> Furthermore, if Country referred back to `ex:ADM1Shape` of its >> regions, we'd have a recursive shape and the result would be undefined. >> >>> >> >>> It may therefore be more convenient to check only the **existence** >> of Country from ADM1, >> >>> and depend that some other process will check the validity of Country. >> >>> We could do it like this: >> >>> >> >>> ``` >> >>> ex:CountryReferenceShape a sh:NodeShape; >> >>> sh:class gn:Feature; >> >>> sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD >> gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)]; >> >>> . >> >>> >> >>> ex:CountryShape a sh:NodeShape; >> >>> sh:target ex:CountryReferenceShape; >> >>> sh:property [sh:path gn:countryCode; sh:datatype xsd:string; >> sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> >> >>> ex:ADM1ReferenceShape a sh:NodeShape; >> >>> sh:class gn:Feature; >> >>> sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1]; >> >>> . >> >>> >> >>> ex:ADM1Shape a sh:NodeShape; >> >>> sh:target ex:ADM1ReferenceShape; >> >>> sh:property [sh:path gn:parentCountry; sh:node >> ex:CountryReferenceShape; sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> ``` >> >>> >> >>> The significant change is in the last line: ADM1 checks >> `ex:CountryReferenceShape` rather than `ex:CountryShape`. >> >>> And we reuse `ex:CountryReferenceShape` as both: >> >>> - Existence check in `ex:ADM1Shape` >> >>> - Targeting shape in `ex:CountryShape` >> >>> >> >>> ## Politicians and Parties >> >>> >> >>> Let's say every Party has at least one Politician, >> >>> every Politician belongs to exactly one Party (ok, that is >> unrealistic), >> >>> politicians are defined by a combination of `rdf:type` and `dc:type`, >> >>> and both Parties and Politicians adhere to one of two politics >> (Democrat vs Republican). >> >>> >> >>> If we model this with two shapes that refer to each other, we'd have >> recursive shapes. >> >>> So again we use two shapes for every entity: >> >>> - A "smaller" ReferenceShape that just checks existence in terms of >> "semantic type discrimination" >> >>> - A "bigger" Shape that checks all other properties of the instance, >> and uses the ReferenceShape for targeting >> >>> >> >>> This eliminates the recursion. >> >>> >> >>> ``` >> >>> ex:PoliticianReferenceShape a sh:NodeShape; >> >>> sh:property [sh:path rdf:type; sh:in (foaf:Person dbo:Person)]; >> >>> sh:property [sh:path dc:type; sh:hasValue "politician"]; >> >>> . >> >>> ex:PoliticianShape a sh:NodeShape; >> >>> sh:target ex:PoliticianReferenceShape; >> >>> sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")]; >> >>> sh:property [sh:path ex:party; sh:node ex:PartyReferenceShape; >> sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> ex:PartyReference a sh:NodeShape; >> >>> sh:property [sh:path rdf:type; sh:hasValue foaf:Organization]; >> >>> sh:property [sh:path dc:type; sh:hasValue "political party"]; >> >>> . >> >>> ex:PartyShape a sh:NodeShape; >> >>> sh:target ex:PartyReferenceShape; >> >>> sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")]; >> >>> sh:property [sh:path ex:politician; sh:node >> ex:PoliticianReferenceShape; sh:minCount 1]; >> >>> . >> >>> ``` >> >
Received on Friday, 5 June 2020 08:00:53 UTC