- From: Holger Knublauch <holger@topquadrant.com>
- Date: Fri, 5 Jun 2020 18:32:27 +1000
- To: Håvard Ottestad <hmottestad@gmail.com>
- Cc: Public Shacl W3C <public-shacl@w3.org>
- Message-ID: <16bfce22-0106-1d92-1d6b-6cc2ec01faa0@topquadrant.com>
On 5/06/2020 18:00, Håvard Ottestad wrote:
> We are planning on generating a single SPARQL query for that case. We
> haven't started working on this yet. Our plan is to have two
> approaches: we analyze the transaction and estimate the cost of a
> "transactional" validation and a "full SPARQL" validation and run
> whichever is faster for that particular transaction.
>
> Wouldn't dash:AllSubject and dash:AllObjects be just as slow, or
> sh:targetClass rdfs:Resource for that matter? How would these be
> optimized better than a target shape representing "all objects that
> match a regex pattern"?
Yes that would be just as slow. I am not advocating that. Just wanted to
point out that there are risks of offering features that are too
overpowering and thus very difficult for us implementers to cover properly.
Holger
>
> Håvard
>
> On Fri, Jun 5, 2020 at 9:31 AM Holger Knublauch
> <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>
>
> On 5/06/2020 17:02, Håvard Ottestad wrote:
>> >Ok, does this apply to the case where you have a target shape
>> and want
>> to find all nodes in the graph that conform?
>>
>> Yes. All those are trivial as target shapes.
>>
>> For the example below the data added by the user becomes the
>> starting point
>
> Ok, that's of course easier because you already have a small
> subset of node. But then we are not talking about the same use
> case. What happens if you need to run the full validation of the
> full graph? E.g. someone puts a sh:pattern on rdfs:label and there
> are (which is realistic) millions of labels in the database?
>
> Holger
>
>
>> for the validation. A target is either added in this transaction,
>> in which case we retrieve all its foaf:age paths and validate
>> those. Or a path is added to an existing target, in which case we
>> have a node to start on (the subject of the path).
>>
>> ex:CompanyShape a sh:NodeShape;
>> sh:target [a sh:NodeShape;
>> sh:nodeKind sh:IRI;
>> sh:pattern
>> "^https://company-graph.ontotext.com/resource/company/";
>> ];
>> sh:property [sh:path foaf:age; sh:datatype xsd:integer ];
>> .
>>
>> Håvard
>>
>> On Fri, Jun 5, 2020 at 8:28 AM Holger Knublauch
>> <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>>
>>
>> On 5/06/2020 15:49, Håvard Ottestad wrote:
>> > Hi,
>> >
>> > Just a quick response performance wise.
>> >
>> > SPARQL targets are very slow because the RDF4J ShaclSail
>> can’t analyze a transaction to decide what to validate. Shape
>> based targets on the other hand can be used to generate a
>> validation plan that utilizes the changeset of the
>> transaction to only validate a small subset of the data.
>> The design of named SPARQL targets means that if a name gets
>> established
>> (e.g. as a de-facto standard) then an engine may hard-code
>> it. However,
>> the SPARQL remains as a fallback.
>> >
>> > The more complex the target shape the larger this subset
>> becomes and the more data needs to be considered.
>> >
>> > For us sh:datatype, sh:nodeKind, sh:minExclusive etc,
>> sh:minLength etc, sh:pattern, sh:languageIn, sh:uniqueLang
>> are actually trivial to validate when used in single
>> predicate path shapes.
>>
>> Ok, does this apply to the case where you have a target shape
>> and want
>> to find all nodes in the graph that conform?
>>
>> Holger
>>
>>
>> >
>> > We are currently supporting sh:hasValue, sh:or, sh:and,
>> sh:property and sh:path as long as the effective path is a
>> single predicate (so no nested sh:property).
>> >
>> > Håvard
>> >
>> >> On 5 Jun 2020, at 04:30, Holger Knublauch
>> <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>> >>
>> >> Hi Vladimir,
>> >>
>> >> from a specification point of view I see no show stoppers
>> to introducing such a mechanism. I would however introduce a
>> new property instead of sh:target, because the meaning of
>> sh:target would otherwise be overloaded and it is possible
>> for targets to also be sh:NodeShapes in which case the result
>> will be very surprising. So, IMHO it should be something like
>> sh:targetShape (or the earlier, verbose
>> sh:targetNodesConforming).
>> >>
>> >> From a practical point of view, I remain very nervous
>> about performance implications. It will be too easy for users
>> to produce some really inefficient scenarios where any
>> implementation almost certainly must iterate over all nodes
>> in the whole graph.E.g. sh:targetShape [ sh:datatype
>> xsd:string ] requires walking through all existing objects in
>> the graph, likewise something with sh:languageIn or sh:pattern.
>> >>
>> >> If we offer such a feature then we may invite
>> disappointment from users, and statements such as "SHACL is
>> slow". Sometimes less is more. Note that any sh:targetShape
>> statement means that even a simple check such as "is node N
>> in the target of S" requires iterating over all
>> sh:targetShapes each time. This can be very expensive.
>> >>
>> >> The implementation cost of this feature is significant,
>> because it requires the implementation of an "inverse
>> validation" algorithm. Validation starts with a focus node
>> and returns a result. The inverse would start with the shape
>> and has to discover the valid focus nodes. For example, in
>> the case of sh:targetShape [ sh:class X ; sh:property [
>> sh:path p ; sh:hasValue Z ] ] an algorithm has the choice
>> between first looping over all instances of X and then
>> checking if they have Z or vice versa. Yes, it's an
>> opportunity for developing interesting algorithms, and such
>> an inverse validation algorithm would be beneficial and
>> interesting for many use cases anyway. I personally can at
>> the moment not commit time for such an algorithm so I would,
>> in order to fulfill such a spec, introduce a painfully slow
>> brute-force algorithm. Other implementers may be in the same
>> boat, raising the bar for implementers significantly.
>> >>
>> >> Meanwhile, SPARQL-based targets already exist, and give
>> users control over how efficient the implementation will be
>> able to understand them. For example, such a target could
>> just be "SELECT ?this WHERE { ?this ex:nationality
>> ex:Norwagian }" and any off-the-shelf SPARQL engine can be
>> used to evaluate that.
>> >>
>> >> So while I agree with the use case, and the fact that this
>> might be more direct than sh:filterShape (which has its own
>> problems), I am quite nervous that we are over-promising here.
>> >>
>> >> Do you guys already have implementations of such inverse
>> validation algorithms?
>> >>
>> >> ---
>> >>
>> >> Here is another thought: looking through the Core
>> constraint types, I guess most of them are hard to execute in
>> the inverse order: sh:datatype, sh:nodeKind, sh:minExclusive
>> etc, sh:minLength etc, sh:pattern, sh:languageIn,
>> sh:uniqueLang, sh:lessThan etc, sh:closed, and also the
>> XYcount ones all basically require walking through all
>> subjects and objects in the graph. However, the following are
>> quite easy to revert:
>> >>
>> >> - sh:class (= sh:targetClass)
>> >> - sh:hasValue
>> >> - sh:in
>> >>
>> >> So what if we simply introduce a new target type
>> sh:targetHasValue V where the targets can be identified by a
>> direct look-up. For example
>> >>
>> >> ex:KiwiShape
>> >> sh:targetHasValue [
>> >> sh:path ex:nationality ;
>> >> sh:hasValue ex:NewZealand ;
>> >> ] ; ...
>> >>
>> >> which amounts to asking ?this ex:nationality ex:NewZealand
>> which is super fast and covers both sh:hasValue and (to
>> lesser extent) sh:in use cases. In fact, such a thing can be
>> easily expressed as a SHACL-SPARQL target type already, and
>> the syntax could be
>> >>
>> >> ex:KiwiShape
>> >> sh:target [
>> >> a dash:HasValueTarget ;
>> >> dash:predicate ex:nationality ;
>> >> dash:value ex:NewZealand ;
>> >> ] ; ...
>> >>
>> >> and the underlying SPARQL query would be
>> >>
>> >> SELECT ?this
>> >> WHERE {
>> >> ?this $predicate $value .
>> >> }
>> >>
>> >> This wouldn't cover all use cases mentioned here, but at
>> least covers the hasValue scenario, and nothing new needs to
>> be implemented or added to the spec.
>> >>
>> >> Holger
>> >>
>> >>
>> >>> On 4/06/2020 19:31, Vladimir Alexiev wrote:
>> >>> Hi everyone! (This email is formatted as markdown)
>> >>>
>> >>> I have 2 objections to earlier proposals:
>> >>> - According to
>> https://www.w3.org/TR/shacl-af/#node-expressions-filter-shape,
>> >>> `sh:filterShape` is always used with `$this` as seed
>> and `sh:nodes` as generator.
>> >>> So I don't think it can be used for our case.
>> >>> - It seems wrong to me to use `sh:target` and
>> `sh:filterShape` in a disconnected manner
>> >>> (the former with just marker classes, the latter to
>> carry the actual target shape)
>> >>>
>> >>> I thought more about what Holger called
>> `sh:targetNodesConforming`, and I think what we need already
>> exists: target by `NodeShape`.
>> >>> So I think we only need to add a new subsection of
>> https://www.w3.org/TR/shacl-af/#targets but no new classes or
>> properties.
>> >>>
>> >>>> Separating sh:AllSubjects and sh:AllObjects separately
>> would offer more flexibility too
>> >>> Both subjects and objects are Nodes in the graph.
>> >>> I think `NodeShape` already gives us enough flexibility
>> to select one or the other
>> >>> (there are 2 related examples below: selecting by IRI
>> pattern, and selecting langString literals).
>> >>> Just like we don't have distinct `SubjectNodeShape` vs
>> `ObjectNodeShape`,
>> >>> I don't think we need such distinction for targeting either.
>> >>>
>> >>> Below is a proposal for such new subsection, please comment.
>> >>>
>> >>> # NodeShape Targets
>> >>>
>> >>> Sometimes it is useful to find nodes by shape, and then
>> validate them using another shape.
>> >>> To do this, you can use `sh:target` that is a `sh:NodeShape`:
>> >>>
>> >>> ```
>> >>> ex:MyNodeShape a sh:NodeShape;
>> >>> sh:target [a sh:NodeShape;
>> >>> <NodeShape constructs for target>
>> >>> ];
>> >>> <NodeShape constructs for validation>
>> >>> .
>> >>> ```
>> >>>
>> >>> In the following subsections we show several examples of
>> this design.
>> >>>
>> >>> ## Target by Property and Object
>> >>>
>> >>> Norwegians must have one norwegianID:
>> >>>
>> >>> ```
>> >>> ex:NorwegianShape a sh:NodeShape;
>> >>> sh:target [a sh:NodeShape;
>> >>> sh:property [sh:path ex:nationality; sh:hasValue
>> ex:Norway];
>> >>> ];
>> >>> sh:property [sh:path ex:norwegianID; sh:minCount 1;
>> sh:maxCount 1];
>> >>> .
>> >>> ```
>> >>>
>> >>> ## Target Namespace Instances
>> >>>
>> >>> All instances in a given namespace must have a certain shape:
>> >>>
>> >>> ```
>> >>> ex:CompanyShape a sh:NodeShape;
>> >>> sh:target [a sh:NodeShape;
>> >>> sh:nodeKind sh:IRI;
>> >>> sh:pattern
>> "^https://company-graph.ontotext.com/resource/company/";
>> >>> ];
>> >>> sh:class ex:Company;
>> >>> sh:property [sh:path dc:type; sh:in ("conglomerate"
>> "collective" "enterprise")];
>> >>> .
>> >>> ```
>> >>>
>> >>> ## Target All langStrings
>> >>>
>> >>> All langStrings must have one of a predefind set of
>> languages:
>> >>>
>> >>> ```
>> >>> ex:langStringShape a sh:NodeShape;
>> >>> sh:target [a sh:NodeShape;
>> >>> sh:datatype rdf:langString;
>> >>> ];
>> >>> sh:languageIn ("en" "bg");
>> >>> .
>> >>> ```
>> >>>
>> >>> ## Target By Cardinality
>> >>>
>> >>> Let's say a person Steve is very popular, so everyone who
>> knows at least three people must know Steve:
>> >>> ```
>> >>> ex:Personshape a sh:NodeShape;
>> >>> sh:target [a sh:NodeShape;
>> >>> sh:property [sh:path foaf:knows; sh:minCount 3];
>> >>> ];
>> >>> sh:property [sh:path foaf:knows; sh:hasValue ex:Steve];
>> >>> .
>> >>> ```
>> >>>
>> >>> ## Semantic Type Discrimination
>> >>>
>> >>> In some datasets, instances are not discriminated by
>> `rdf:type` alone, but also by other traits.
>> >>> Often more than one check needs to be performed.
>> >>>
>> >>> Eg in Geonames, all instances have type `gn:Feature`, and
>> are further discriminated by `gn:featureCode`.
>> >>> That's a 2-level classification of some 650 codes that
>> includes everything from continents to mountains to pipelines
>> to hotels.
>> >>>
>> >>> Imagine that you're interested only in countries and
>> top-level administrative divisions (states, provinces and the
>> like).
>> >>> - A bunch of codes correspond to the concept "country"
>> >>> - Countries have `gn:countryCode`
>> >>> - Only the code `gn:ADM1` corresponds to top-level
>> administrative divisions
>> >>> - Administrative divisions have `gn:parentCountry`
>> >>> (This does not describe all Geonames fields, only the
>> ones that we need.)
>> >>>
>> >>> ```
>> >>> gn:Feature a sh:NodeShape, rdf:Class;
>> >>> # implicit: sh:targetClass gn:Feature;
>> >>> sh:property [sh:path gn:name; sh:datatype xsd:string;
>> sh:minCount 1; sh:maxCount 1];
>> >>> sh:property [sh:path gn:featureClass; sh:nodeKind
>> sh:IRI; sh:minCount 1; sh:maxCount 1];
>> >>> sh:property [sh:path gn:featureCode; sh:nodeKind
>> sh:IRI; sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>>
>> >>> ex:CountryShape a sh:NodeShape;
>> >>> sh:target [a sh:NodeShape;
>> >>> sh:class gn:Feature;
>> >>> sh:property [sh:path gn:featureCode; sh:in
>> (gn:A.PCLI gn:A.PCLD gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR
>> gn:A.PCLF)];
>> >>> ];
>> >>> sh:property [sh:path gn:countryCode; sh:datatype
>> xsd:string; sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>>
>> >>> ex:ADM1Shape a sh:NodeShape;
>> >>> sh:target [a sh:NodeShape;
>> >>> sh:class gn:Feature;
>> >>> sh:property [sh:path gn:featureCode; sh:hasValue
>> gn:ADM1];
>> >>> ];
>> >>> sh:property [sh:path gn:parentCountry; sh:node
>> ex:CountryShape; sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>> ```
>> >>>
>> >>> ## Targeting and Reference Shapes
>> >>>
>> >>> In the last example we stated that `gn:parentCountry`
>> must point to something that satisfies `ex:CountryShape`.
>> >>> This means that every time we validate `ex:ADM1Shape`, we
>> need to validate its country (together with the
>> country-specific properties).
>> >>> So the validation of ADM1 must recurse into validation of
>> Country.
>> >>>
>> >>> This is not always convenient since it's hard to control
>> this recursive process.
>> >>> Furthermore, if Country referred back to `ex:ADM1Shape`
>> of its regions, we'd have a recursive shape and the result
>> would be undefined.
>> >>>
>> >>> It may therefore be more convenient to check only the
>> **existence** of Country from ADM1,
>> >>> and depend that some other process will check the
>> validity of Country.
>> >>> We could do it like this:
>> >>>
>> >>> ```
>> >>> ex:CountryReferenceShape a sh:NodeShape;
>> >>> sh:class gn:Feature;
>> >>> sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI
>> gn:A.PCLD gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
>> >>> .
>> >>>
>> >>> ex:CountryShape a sh:NodeShape;
>> >>> sh:target ex:CountryReferenceShape;
>> >>> sh:property [sh:path gn:countryCode; sh:datatype
>> xsd:string; sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>>
>> >>> ex:ADM1ReferenceShape a sh:NodeShape;
>> >>> sh:class gn:Feature;
>> >>> sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
>> >>> .
>> >>>
>> >>> ex:ADM1Shape a sh:NodeShape;
>> >>> sh:target ex:ADM1ReferenceShape;
>> >>> sh:property [sh:path gn:parentCountry; sh:node
>> ex:CountryReferenceShape; sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>> ```
>> >>>
>> >>> The significant change is in the last line: ADM1 checks
>> `ex:CountryReferenceShape` rather than `ex:CountryShape`.
>> >>> And we reuse `ex:CountryReferenceShape` as both:
>> >>> - Existence check in `ex:ADM1Shape`
>> >>> - Targeting shape in `ex:CountryShape`
>> >>>
>> >>> ## Politicians and Parties
>> >>>
>> >>> Let's say every Party has at least one Politician,
>> >>> every Politician belongs to exactly one Party (ok, that
>> is unrealistic),
>> >>> politicians are defined by a combination of `rdf:type`
>> and `dc:type`,
>> >>> and both Parties and Politicians adhere to one of two
>> politics (Democrat vs Republican).
>> >>>
>> >>> If we model this with two shapes that refer to each
>> other, we'd have recursive shapes.
>> >>> So again we use two shapes for every entity:
>> >>> - A "smaller" ReferenceShape that just checks existence
>> in terms of "semantic type discrimination"
>> >>> - A "bigger" Shape that checks all other properties of
>> the instance, and uses the ReferenceShape for targeting
>> >>>
>> >>> This eliminates the recursion.
>> >>>
>> >>> ```
>> >>> ex:PoliticianReferenceShape a sh:NodeShape;
>> >>> sh:property [sh:path rdf:type; sh:in (foaf:Person
>> dbo:Person)];
>> >>> sh:property [sh:path dc:type; sh:hasValue "politician"];
>> >>> .
>> >>> ex:PoliticianShape a sh:NodeShape;
>> >>> sh:target ex:PoliticianReferenceShape;
>> >>> sh:property [sh:path ex:politics; sh:in ("Democrat"
>> "Republican")];
>> >>> sh:property [sh:path ex:party; sh:node
>> ex:PartyReferenceShape; sh:minCount 1; sh:maxCount 1];
>> >>> .
>> >>> ex:PartyReference a sh:NodeShape;
>> >>> sh:property [sh:path rdf:type; sh:hasValue
>> foaf:Organization];
>> >>> sh:property [sh:path dc:type; sh:hasValue "political
>> party"];
>> >>> .
>> >>> ex:PartyShape a sh:NodeShape;
>> >>> sh:target ex:PartyReferenceShape;
>> >>> sh:property [sh:path ex:politics; sh:in ("Democrat"
>> "Republican")];
>> >>> sh:property [sh:path ex:politician; sh:node
>> ex:PoliticianReferenceShape; sh:minCount 1];
>> >>> .
>> >>> ```
>>
Received on Friday, 5 June 2020 08:32:48 UTC