Re: SHACL target extension from Vladimir Alexiev on 2020-06-04 (public-shacl@w3.org from June 2020)

From: Vladimir Alexiev <vladimir.alexiev@ontotext.com>
Date: Thu, 4 Jun 2020 12:31:53 +0300
To: Public Shacl W3C <public-shacl@w3.org>
Message-ID: <CAMv+wg52_mHXkP6S8V6fQ0FMy0dQbc7k0hF9_+ZZRBwAWCHV7w@mail.gmail.com>
Hi everyone! (This email is formatted as markdown)

I have 2 objections to earlier proposals:
- According to https://www.w3.org/TR/shacl-af/#node-expressions-filter-shape,

  `sh:filterShape` is always used with `$this` as seed and `sh:nodes` as
generator.
  So I don't think it can be used for our case.
- It seems wrong to me to use `sh:target` and `sh:filterShape` in a
disconnected manner
  (the former with just marker classes, the latter to carry the actual
target shape)

I thought more about what Holger called `sh:targetNodesConforming`, and I
think what we need already exists: target by `NodeShape`.
So I think we only need to add a new subsection of
https://www.w3.org/TR/shacl-af/#targets but no new classes or properties.

> Separating sh:AllSubjects and sh:AllObjects separately would offer more
flexibility too

Both subjects and objects are Nodes in the graph.
I think `NodeShape` already gives us enough flexibility to select one or
the other
(there are 2 related examples below: selecting by IRI pattern, and
selecting langString literals).
Just like we don't have distinct `SubjectNodeShape` vs `ObjectNodeShape`,
I don't think we need such distinction for targeting either.

Below is a proposal for such new subsection, please comment.

# NodeShape Targets

Sometimes it is useful to find nodes by shape, and then validate them using
another shape.
To do this, you can use `sh:target` that is a `sh:NodeShape`:

```
ex:MyNodeShape a sh:NodeShape;
  sh:target [a sh:NodeShape;
    <NodeShape constructs for target>
  ];
  <NodeShape constructs for validation>
.
```

In the following subsections we show several examples of this design.

## Target by Property and Object

Norwegians must have one norwegianID:

```
ex:NorwegianShape a sh:NodeShape;
  sh:target [a sh:NodeShape;
    sh:property [sh:path ex:nationality; sh:hasValue ex:Norway];
  ];
  sh:property [sh:path ex:norwegianID; sh:minCount 1; sh:maxCount 1];
.
```

## Target Namespace Instances

All instances in a given namespace must have a certain shape:

```
ex:CompanyShape a sh:NodeShape;
  sh:target [a sh:NodeShape;
    sh:nodeKind sh:IRI;
    sh:pattern "^https://company-graph.ontotext.com/resource/company/";
  ];
  sh:class ex:Company;
  sh:property [sh:path dc:type; sh:in ("conglomerate" "collective"
"enterprise")];
.
```

## Target All langStrings

All langStrings must have one of a predefind set of languages:

```
ex:langStringShape a sh:NodeShape;
  sh:target [a sh:NodeShape;
    sh:datatype rdf:langString;
  ];
  sh:languageIn ("en" "bg");
.
```

## Target By Cardinality

Let's say a person Steve is very popular, so everyone who knows at least
three people must know Steve:
```
ex:Personshape a sh:NodeShape;
  sh:target [a sh:NodeShape;
    sh:property [sh:path foaf:knows; sh:minCount 3];
  ];
  sh:property [sh:path foaf:knows; sh:hasValue ex:Steve];
.
```

## Semantic Type Discrimination

In some datasets, instances are not discriminated by `rdf:type` alone, but
also by other traits.
Often more than one check needs to be performed.

Eg in Geonames, all instances have type `gn:Feature`, and are further
discriminated by `gn:featureCode`.
That's a 2-level classification of some 650 codes that includes everything
from continents to mountains to pipelines to hotels.

Imagine that you're interested only in countries and top-level
administrative divisions (states, provinces and the like).
- A bunch of codes correspond to the concept "country"
- Countries have `gn:countryCode`
- Only the code `gn:ADM1` corresponds to top-level administrative divisions
- Administrative divisions have `gn:parentCountry`
(This does not describe all Geonames fields, only the ones that we need.)

```
gn:Feature a sh:NodeShape, rdf:Class;
  # implicit: sh:targetClass gn:Feature;
  sh:property [sh:path gn:name;         sh:datatype xsd:string; sh:minCount
1; sh:maxCount 1];
  sh:property [sh:path gn:featureClass; sh:nodeKind sh:IRI;     sh:minCount
1; sh:maxCount 1];
  sh:property [sh:path gn:featureCode;  sh:nodeKind sh:IRI;     sh:minCount
1; sh:maxCount 1];
.

ex:CountryShape a sh:NodeShape;
  sh:target [a sh:NodeShape;
    sh:class gn:Feature;
    sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD
gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
  ];
  sh:property [sh:path gn:countryCode; sh:datatype xsd:string; sh:minCount
1; sh:maxCount 1];
.

ex:ADM1Shape a sh:NodeShape;
  sh:target [a sh:NodeShape;
    sh:class gn:Feature;
    sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
  ];
  sh:property [sh:path gn:parentCountry; sh:node ex:CountryShape;
sh:minCount 1; sh:maxCount 1];
.
```

## Targeting and Reference Shapes

In the last example we stated that `gn:parentCountry` must point to
something that satisfies `ex:CountryShape`.
This means that every time we validate `ex:ADM1Shape`, we need to validate
its country (together with the country-specific properties).
So the validation of ADM1 must recurse into validation of Country.

This is not always convenient since it's hard to control this recursive
process.
Furthermore, if Country referred back to `ex:ADM1Shape` of its regions,
we'd have a recursive shape and the result would be undefined.

It may therefore be more convenient to check only the **existence** of
Country from ADM1,
and depend that some other process will check the validity of Country.
We could do it like this:

```
ex:CountryReferenceShape a sh:NodeShape;
  sh:class gn:Feature;
  sh:property [sh:path gn:featureCode; sh:in (gn:A.PCLI gn:A.PCLD
gn:A.PCLIX gn:A.PCLS gn:A.PCL gn:A.TERR gn:A.PCLF)];
.

ex:CountryShape a sh:NodeShape;
  sh:target ex:CountryReferenceShape;
  sh:property [sh:path gn:countryCode; sh:datatype xsd:string; sh:minCount
1; sh:maxCount 1];
.

ex:ADM1ReferenceShape a sh:NodeShape;
  sh:class gn:Feature;
  sh:property [sh:path gn:featureCode; sh:hasValue gn:ADM1];
.

ex:ADM1Shape a sh:NodeShape;
  sh:target ex:ADM1ReferenceShape;
  sh:property [sh:path gn:parentCountry; sh:node ex:CountryReferenceShape;
sh:minCount 1; sh:maxCount 1];
.
```

The significant change is in the last line: ADM1 checks
`ex:CountryReferenceShape` rather than `ex:CountryShape`.
And we reuse `ex:CountryReferenceShape` as both:
- Existence check in `ex:ADM1Shape`
- Targeting shape in `ex:CountryShape`

## Politicians and Parties

Let's say every Party has at least one Politician,
every Politician belongs to exactly one Party (ok, that is unrealistic),
politicians are defined by a combination of `rdf:type` and `dc:type`,
and both Parties and Politicians adhere to one of two politics (Democrat vs
Republican).

If we model this with two shapes that refer to each other, we'd have
recursive shapes.
So again we use two shapes for every entity:
- A "smaller" ReferenceShape that just checks existence in terms of
"semantic type discrimination"
- A "bigger" Shape that checks all other properties of the instance, and
uses the ReferenceShape for targeting

This eliminates the recursion.

```
ex:PoliticianReferenceShape a sh:NodeShape;
  sh:property [sh:path rdf:type; sh:in (foaf:Person dbo:Person)];
  sh:property [sh:path dc:type; sh:hasValue "politician"];
.
ex:PoliticianShape a sh:NodeShape;
  sh:target ex:PoliticianReferenceShape;
  sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")];
  sh:property [sh:path ex:party; sh:node ex:PartyReferenceShape;
sh:minCount 1; sh:maxCount 1];
.
ex:PartyReference a sh:NodeShape;
  sh:property [sh:path rdf:type; sh:hasValue foaf:Organization];
  sh:property [sh:path dc:type; sh:hasValue "political party"];
.
ex:PartyShape a sh:NodeShape;
  sh:target ex:PartyReferenceShape;
  sh:property [sh:path ex:politics; sh:in ("Democrat" "Republican")];
  sh:property [sh:path ex:politician; sh:node ex:PoliticianReferenceShape;
sh:minCount 1];
.
```
Received on Thursday, 4 June 2020 09:32:20 UTC