- From: Holger Knublauch <holger@topquadrant.com>
- Date: Fri, 5 Jun 2020 18:32:27 +1000
- To: Håvard Ottestad <hmottestad@gmail.com>
- Cc: Public Shacl W3C <public-shacl@w3.org>
- Message-ID: <16bfce22-0106-1d92-1d6b-6cc2ec01faa0@topquadrant.com>
On 5/06/2020 18:00, Håvard Ottestad wrote: > We are planning on generating a single SPARQL query for that case. We > haven't started working on this yet. Our plan is to have two > approaches: we analyze the transaction and estimate the cost of a > "transactional" validation and a "full SPARQL" validation and run > whichever is faster for that particular transaction. > > Wouldn't dash:AllSubject and dash:AllObjects be just as slow, or > sh:targetClass rdfs:Resource for that matter? How would these be > optimized better than a target shape representing "all objects that > match a regex pattern"? Yes that would be just as slow. I am not advocating that. Just wanted to point out that there are risks of offering features that are too overpowering and thus very difficult for us implementers to cover properly. Holger > > Håvard > > On Fri, Jun 5, 2020 at 9:31 AM Holger Knublauch > <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote: > > > On 5/06/2020 17:02, Håvard Ottestad wrote: >> >Ok, does this apply to the case where you have a target shape >> and want >> to find all nodes in the graph that conform? >> >> Yes. All those are trivial as target shapes. >> >> For the example below the data added by the user becomes the >> starting point > > Ok, that's of course easier because you already have a small > subset of node. But then we are not talking about the same use > case. What happens if you need to run the full validation of the > full graph? E.g. someone puts a sh:pattern on rdfs:label and there > are (which is realistic) millions of labels in the database? > > Holger > > >> for the validation. A target is either added in this transaction, >> in which case we retrieve all its foaf:age paths and validate >> those. Or a path is added to an existing target, in which case we >> have a node to start on (the subject of the path). >> >> ex:CompanyShape a sh:NodeShape; >> sh:target [a sh:NodeShape; >> sh:nodeKind sh:IRI; >> sh:pattern >> "^https://company-graph.ontotext.com/resource/company/"; >> ]; >> sh:property [sh:path foaf:age; sh:datatype xsd:integer ]; >> . >> >> Håvard >> >> On Fri, Jun 5, 2020 at 8:28 AM Holger Knublauch >> <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote: >> >> >> On 5/06/2020 15:49, Håvard Ottestad wrote: >> > Hi, >> > >> > Just a quick response performance wise. >> > >> > SPARQL targets are very slow because the RDF4J ShaclSail >> can’t analyze a transaction to decide what to validate. Shape >> based targets on the other hand can be used to generate a >> validation plan that utilizes the changeset of the >> transaction to only validate a small subset of the data. >> The design of named SPARQL targets means that if a name gets >> established >> (e.g. as a de-facto standard) then an engine may hard-code >> it. However, >> the SPARQL remains as a fallback. >> > >> > The more complex the target shape the larger this subset >> becomes and the more data needs to be considered. >> > >> > For us sh:datatype, sh:nodeKind, sh:minExclusive etc, >> sh:minLength etc, sh:pattern, sh:languageIn, sh:uniqueLang >> are actually trivial to validate when used in single >> predicate path shapes. >> >> Ok, does this apply to the case where you have a target shape >> and want >> to find all nodes in the graph that conform? >> >> Holger >> >> >> > >> > We are currently supporting sh:hasValue, sh:or, sh:and, >> sh:property and sh:path as long as the effective path is a >> single predicate (so no nested sh:property). >> > >> > Håvard >> > >> >> On 5 Jun 2020, at 04:30, Holger Knublauch >> <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote: >> >> >> >> Hi Vladimir, >> >> >> >> from a specification point of view I see no show stoppers >> to introducing such a mechanism. I would however introduce a >> new property instead of sh:target, because the meaning of >> sh:target would otherwise be overloaded and it is possible >> for targets to also be sh:NodeShapes in which case the result >> will be very surprising. So, IMHO it should be something like >> sh:targetShape (or the earlier, verbose >> sh:targetNodesConforming). >> >> >> >> From a practical point of view, I remain very nervous >> about performance implications. It will be too easy for users >> to produce some really inefficient scenarios where any >> implementation almost certainly must iterate over all nodes >> in the whole graph.E.g. sh:targetShape [ sh:datatype >> xsd:string ] requires walking through all existing objects in >> the graph, likewise something with sh:languageIn or sh:pattern. >> >> >> >> If we offer such a feature then we may invite >> disappointment from users, and statements such as "SHACL is >> slow". Sometimes less is more. Note that any sh:targetShape >> statement means that even a simple check such as "is node N >> in the target of S" requires iterating over all >> sh:targetShapes each time. This can be very expensive. >> >> >> >> The implementation cost of this feature is significant, >> because it requires the implementation of an "inverse >> validation" algorithm. Validation starts with a focus node >> and returns a result. The inverse would start with the shape >> and has to discover the valid focus nodes. For example, in >> the case of sh:targetShape [ sh:class X ; sh:property [ >> sh:path p ; sh:hasValue Z ] ] an algorithm has the choice >> between first looping over all instances of X and then >> checking if they have Z or vice versa. Yes, it's an >> opportunity for developing interesting algorithms, and such >> an inverse validation algorithm would be beneficial and >> interesting for many use cases anyway. I personally can at >> the moment not commit time for such an algorithm so I would, >> in order to fulfill such a spec, introduce a painfully slow >> brute-force algorithm. Other implementers may be in the same >> boat, raising the bar for implementers significantly. >> >> >> >> Meanwhile, SPARQL-based targets already exist, and give >> users control over how efficient the implementation will be >> able to understand them. For example, such a target could >> just be "SELECT ?this WHERE { ?this ex:nationality >> ex:Norwagian }" and any off-the-shelf SPARQL engine can be >> used to evaluate that. >> >> >> >> So while I agree with the use case, and the fact that this >> might be more direct than sh:filterShape (which has its own >> problems), I am quite nervous that we are over-promising here. >> >> >> >> Do you guys already have implementations of such inverse >> validation algorithms? >> >> >> >> --- >> >> >> >> Here is another thought: looking through the Core >> constraint types, I guess most of them are hard to execute in >> the inverse order: sh:datatype, sh:nodeKind, sh:minExclusive >> etc, sh:minLength etc, sh:pattern, sh:languageIn, >> sh:uniqueLang, sh:lessThan etc, sh:closed, and also the >> XYcount ones all basically require walking through all >> subjects and objects in the graph. However, the following are >> quite easy to revert: >> >> >> >> - sh:class (= sh:targetClass) >> >> - sh:hasValue >> >> - sh:in >> >> >> >> So what if we simply introduce a new target type >> sh:targetHasValue V where the targets can be identified by a >> direct look-up. For example >> >> >> >> ex:KiwiShape >> >> sh:targetHasValue [ >> >> sh:path ex:nationality ; >> >> sh:hasValue ex:NewZealand ; >> >> ] ; ... >> >> >> >> which amounts to asking ?this ex:nationality ex:NewZealand >> which is super fast and covers both sh:hasValue and (to >> lesser extent) sh:in use cases. In fact, such a thing can be >> easily expressed as a SHACL-SPARQL target type already, and >> the syntax could be >> >> >> >> ex:KiwiShape >> >> sh:target [ >> >> a dash:HasValueTarget ; >> >> dash:predicate ex:nationality ; >> >> dash:value ex:NewZealand ; >> >> ] ; ... >> >> >> >> and the underlying SPARQL query would be >> >> >> >> SELECT ?this >> >> WHERE { >> >> ?this $predicate $value . >> >> } >> >> >> >> This wouldn't cover all use cases mentioned here, but at >> least covers the hasValue scenario, and nothing new needs to >> be implemented or added to the spec. >> >> >> >> Holger >> >> >> >> >> >>> On 4/06/2020 19:31, Vladimir Alexiev wrote: >> >>> Hi everyone! (This email is formatted as markdown) >> >>> >> >>> I have 2 objections to earlier proposals: >> >>> - According to >> https://www.w3.org/TR/shacl-af/#node-expressions-filter-shape, >> >>> `sh:filterShape` is always used with `$this` as seed >> and `sh:nodes` as generator. >> >>> So I don't think it can be used for our case. >> >>> - It seems wrong to me to use `sh:target` and >> `sh:filterShape` in a disconnected manner >> >>> (the former with just marker classes, the latter to >> carry the actual target shape) >> >>> >> >>> I thought more about what Holger called >> `sh:targetNodesConforming`, and I think what we need already >> exists: target by `NodeShape`. >> >>> So I think we only need to add a new subsection of >> https://www.w3.org/TR/shacl-af/#targets but no new classes or >> properties. >> >>> >> >>>> Separating sh:AllSubjects and sh:AllObjects separately >> would offer more flexibility too >> >>> Both subjects and objects are Nodes in the graph. >> >>> I think `NodeShape` already gives us enough flexibility >> to select one or the other >> >>> (there are 2 related examples below: selecting by IRI >> pattern, and selecting langString literals). >> >>> Just like we don't have distinct `SubjectNodeShape` vs >> `ObjectNodeShape`, >> >>> I don't think we need such distinction for targeting either. >> >>> >> >>> Below is a proposal for such new subsection, please comment. >> >>> >> >>> # NodeShape Targets >> >>> >> >>> Sometimes it is useful to find nodes by shape, and then >> validate them using another shape. >> >>> To do this, you can use `sh:target` that is a `sh:NodeShape`: >> >>> >> >>> ``` >> >>> ex:MyNodeShape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> <NodeShape constructs for target> >> >>> ]; >> >>> <NodeShape constructs for validation> >> >>> . >> >>> ``` >> >>> >> >>> In the following subsections we show several examples of >> this design. >> >>> >> >>> ## Target by Property and Object >> >>> >> >>> Norwegians must have one norwegianID: >> >>> >> >>> ``` >> >>> ex:NorwegianShape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:property [sh:path ex:nationality; sh:hasValue >> ex:Norway]; >> >>> ]; >> >>> sh:property [sh:path ex:norwegianID; sh:minCount 1; >> sh:maxCount 1]; >> >>> . >> >>> ``` >> >>> >> >>> ## Target Namespace Instances >> >>> >> >>> All instances in a given namespace must have a certain shape: >> >>> >> >>> ``` >> >>> ex:CompanyShape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:nodeKind sh:IRI; >> >>> sh:pattern >> "^https://company-graph.ontotext.com/resource/company/"; >> >>> ]; >> >>> sh:class ex:Company; >> >>> sh:property [sh:path dc:type; sh:in ("conglomerate" >> "collective" "enterprise")]; >> >>> . >> >>> ``` >> >>> >> >>> ## Target All langStrings >> >>> >> >>> All langStrings must have one of a predefind set of >> languages: >> >>> >> >>> ``` >> >>> ex:langStringShape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:datatype rdf:langString; >> >>> ]; >> >>> sh:languageIn ("en" "bg"); >> >>> . >> >>> ``` >> >>> >> >>> ## Target By Cardinality >> >>> >> >>> Let's say a person Steve is very popular, so everyone who >> knows at least three people must know Steve: >> >>> ``` >> >>> ex:Personshape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:property [sh:path foaf:knows; sh:minCount 3]; >> >>> ]; >> >>> sh:property [sh:path foaf:knows; sh:hasValue ex:Steve]; >> >>> . >> >>> ``` >> >>> >> >>> ## Semantic Type Discrimination >> >>> >> >>> In some datasets, instances are not discriminated by >> `rdf:type` alone, but also by other traits. >> >>> Often more than one check needs to be performed. >> >>> >> >>> Eg in Geonames, all instances have type `gn:Feature`, and >> are further discriminated by `gn:featureCode`. >> >>> That's a 2-level classification of some 650 codes that >> includes everything from continents to mountains to pipelines >> to hotels. >> >>> >> >>> Imagine that you're interested only in countries and >> top-level administrative divisions (states, provinces and the >> like). >> >>> - A bunch of codes correspond to the concept "country" >> >>> - Countries have `gn:countryCode` >> >>> - Only the code `gn:ADM1` corresponds to top-level >> administrative divisions >> >>> - Administrative divisions have `gn:parentCountry` >> >>> (This does not describe all Geonames fields, only the >> ones that we need.) >> >>> >> >>> ``` >> >>> gn:Feature a sh:NodeShape, rdf:Class; >> >>> # implicit: sh:targetClass gn:Feature; >> >>> sh:property [sh:path gn:name; sh:datatype xsd:string; >> sh:minCount 1; sh:maxCount 1]; >> >>> sh:property [sh:path gn:featureClass; sh:nodeKind >> sh:IRI; sh:minCount 1; sh:maxCount 1]; >> >>> sh:property [sh:path gn:featureCode; sh:nodeKind >> sh:IRI; sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> >> >>> ex:CountryShape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:class gn:Feature; >> >>> sh:property [sh:path gn:featureCode; sh:in >> (gn:A.PCLI gn:A.PCLD gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR >> gn:A.PCLF)]; >> >>> ]; >> >>> sh:property [sh:path gn:countryCode; sh:datatype >> xsd:string; sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> >> >>> ex:ADM1Shape a sh:NodeShape; >> >>> sh:target [a sh:NodeShape; >> >>> sh:class gn:Feature; >> >>> sh:property [sh:path gn:featureCode; sh:hasValue >> gn:ADM1]; >> >>> ]; >> >>> sh:property [sh:path gn:parentCountry; sh:node >> ex:CountryShape; sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> ``` >> >>> >> >>> ## Targeting and Reference Shapes >> >>> >> >>> In the last example we stated that `gn:parentCountry` >> must point to something that satisfies `ex:CountryShape`. >> >>> This means that every time we validate `ex:ADM1Shape`, we >> need to validate its country (together with the >> country-specific properties). >> >>> So the validation of ADM1 must recurse into validation of >> Country. >> >>> >> >>> This is not always convenient since it's hard to control >> this recursive process. >> >>> Furthermore, if Country referred back to `ex:ADM1Shape` >> of its regions, we'd have a recursive shape and the result >> would be undefined. >> >>> >> >>> It may therefore be more convenient to check only the >> **existence** of Country from ADM1, >> >>> and depend that some other process will check the >> validity of Country. >> >>> We could do it like this: >> >>> >> >>> ``` >> >>> ex:CountryReferenceShape a sh:NodeShape; >> >>> sh:class gn:Feature; >> >>> sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI >> gn:A.PCLD gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)]; >> >>> . >> >>> >> >>> ex:CountryShape a sh:NodeShape; >> >>> sh:target ex:CountryReferenceShape; >> >>> sh:property [sh:path gn:countryCode; sh:datatype >> xsd:string; sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> >> >>> ex:ADM1ReferenceShape a sh:NodeShape; >> >>> sh:class gn:Feature; >> >>> sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1]; >> >>> . >> >>> >> >>> ex:ADM1Shape a sh:NodeShape; >> >>> sh:target ex:ADM1ReferenceShape; >> >>> sh:property [sh:path gn:parentCountry; sh:node >> ex:CountryReferenceShape; sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> ``` >> >>> >> >>> The significant change is in the last line: ADM1 checks >> `ex:CountryReferenceShape` rather than `ex:CountryShape`. >> >>> And we reuse `ex:CountryReferenceShape` as both: >> >>> - Existence check in `ex:ADM1Shape` >> >>> - Targeting shape in `ex:CountryShape` >> >>> >> >>> ## Politicians and Parties >> >>> >> >>> Let's say every Party has at least one Politician, >> >>> every Politician belongs to exactly one Party (ok, that >> is unrealistic), >> >>> politicians are defined by a combination of `rdf:type` >> and `dc:type`, >> >>> and both Parties and Politicians adhere to one of two >> politics (Democrat vs Republican). >> >>> >> >>> If we model this with two shapes that refer to each >> other, we'd have recursive shapes. >> >>> So again we use two shapes for every entity: >> >>> - A "smaller" ReferenceShape that just checks existence >> in terms of "semantic type discrimination" >> >>> - A "bigger" Shape that checks all other properties of >> the instance, and uses the ReferenceShape for targeting >> >>> >> >>> This eliminates the recursion. >> >>> >> >>> ``` >> >>> ex:PoliticianReferenceShape a sh:NodeShape; >> >>> sh:property [sh:path rdf:type; sh:in (foaf:Person >> dbo:Person)]; >> >>> sh:property [sh:path dc:type; sh:hasValue "politician"]; >> >>> . >> >>> ex:PoliticianShape a sh:NodeShape; >> >>> sh:target ex:PoliticianReferenceShape; >> >>> sh:property [sh:path ex:politics; sh:in ("Democrat" >> "Republican")]; >> >>> sh:property [sh:path ex:party; sh:node >> ex:PartyReferenceShape; sh:minCount 1; sh:maxCount 1]; >> >>> . >> >>> ex:PartyReference a sh:NodeShape; >> >>> sh:property [sh:path rdf:type; sh:hasValue >> foaf:Organization]; >> >>> sh:property [sh:path dc:type; sh:hasValue "political >> party"]; >> >>> . >> >>> ex:PartyShape a sh:NodeShape; >> >>> sh:target ex:PartyReferenceShape; >> >>> sh:property [sh:path ex:politics; sh:in ("Democrat" >> "Republican")]; >> >>> sh:property [sh:path ex:politician; sh:node >> ex:PoliticianReferenceShape; sh:minCount 1]; >> >>> . >> >>> ``` >>
Received on Friday, 5 June 2020 08:32:48 UTC