- From: Kurt Cagle via GitHub <noreply@w3.org>
- Date: Thu, 12 Feb 2026 01:25:27 +0000
- To: public-shacl@w3.org
kurtcagle has just created a new issue for https://github.com/w3c/data-shapes:
== Unit categorisation. ==
It MAY be worth also thinking about a predicate such as sh:countShape which is used to identify counts of things (number of people, number of aircraft, etc.). May be a class, but I think a shape would be more flexible here.
Here's the use case I see: I want to represent different populations, say language speakers in a given country. The datatype is obviously non-negative integers, but I may have 1.5M English speakers, 1.3M Spanish speakers, 0.8M French speakers, etc. (The differentiator could be anything - Republicans vs. Democrats, different age groups and so forth). sh:unit really doesn't handle that case, because the unit is a person (or product or place); rather what I think we need is a unitShape or something similar that allows for the assignment of counts to specific taxonomy shapes (English speaker, Democrat).
This is more of a design pattern. One potential solution:
```
@prefix ex: <http://example.org/demographics/> .
@prefix qb: <http://purl.org/linked-data/cube#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix unit: <http://qudt.org/vocab/unit/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
# ============================================
# DATASET DEFINITION
# ============================================
ex:usaLanguageDataset2023 a ex:PopulationDataset ;
rdfs:label "USA Population by Primary Language, 2023" ;
ex:observationYear "2023"^^xsd:gYear ;
ex:geography ex:USA ;
ex:hasObservation
ex:obs_total,
ex:obs_english,
ex:obs_spanish,
ex:obs_chinese,
ex:obs_other .
# ============================================
# CONTROLLED VOCABULARY (Language Taxonomy)
# ============================================
ex:LanguageTaxonomy a skos:ConceptScheme ;
rdfs:label "ISO 639 Language Codes"@en .
ex:English a ex:Language, skos:Concept ;
skos:prefLabel "English"@en ;
skos:prefLabel "Anglais"@fr ;
skos:notation "eng" ;
skos:inScheme ex:LanguageTaxonomy .
ex:Spanish a ex:Language, skos:Concept ;
skos:prefLabel "Spanish"@en ;
skos:prefLabel "Español"@es ;
skos:notation "spa" ;
skos:inScheme ex:LanguageTaxonomy .
ex:Chinese a ex:Language, skos:Concept ;
skos:prefLabel "Chinese"@en ;
skos:prefLabel "中文"@zh ;
skos:notation "zho" ;
skos:inScheme ex:LanguageTaxonomy .
ex:OtherLanguages a ex:Language, skos:Concept ;
skos:prefLabel "Other Languages"@en ;
skos:notation "other" ;
skos:inScheme ex:LanguageTaxonomy ;
rdfs:comment "Aggregate category for all other languages" .
# ============================================
# OBSERVATIONS (The actual data)
# ============================================
# Total population (baseline for comparison)
ex:obs_total a qb:Observation ;
rdfs:label "Total USA Population 2023" ;
ex:category ex:TotalPopulation ;
ex:count 331900000 ;
ex:unit unit:Person ;
ex:geography ex:USA ;
ex:year "2023"^^xsd:gYear ;
ex:partOf ex:usaLanguageDataset2023 .
# English speakers
ex:obs_english a qb:Observation ;
rdfs:label "English speakers in USA, 2023" ;
ex:language ex:English ;
ex:count 258300000 ;
ex:unit unit:Person ;
ex:geography ex:USA ;
ex:year "2023"^^xsd:gYear ;
ex:partOf ex:usaLanguageDataset2023 .
# Spanish speakers
ex:obs_spanish a qb:Observation ;
rdfs:label "Spanish speakers in USA, 2023" ;
ex:language ex:Spanish ;
ex:count 41500000 ;
ex:unit unit:Person ;
ex:geography ex:USA ;
ex:year "2023"^^xsd:gYear ;
ex:partOf ex:usaLanguageDataset2023 .
# Chinese speakers
ex:obs_chinese a qb:Observation ;
rdfs:label "Chinese speakers in USA, 2023" ;
ex:language ex:Chinese ;
ex:count 3500000 ;
ex:unit unit:Person ;
ex:geography ex:USA ;
ex:year "2023"^^xsd:gYear ;
ex:partOf ex:usaLanguageDataset2023 .
# Other languages (aggregate)
ex:obs_other a qb:Observation ;
rdfs:label "Other language speakers in USA, 2023" ;
ex:language ex:OtherLanguages ;
ex:count 28600000 ;
ex:unit unit:Person ;
ex:geography ex:USA ;
ex:year "2023"^^xsd:gYear ;
ex:partOf ex:usaLanguageDataset2023 .
# ============================================
# SHACL VALIDATION SHAPES
# ============================================
# Shape 1: Validate individual observations
ex:PopulationObservationShape a sh:NodeShape ;
sh:targetClass qb:Observation ;
rdfs:label "Population Observation Validation" ;
# Must have exactly one count
sh:property [
sh:path ex:count ;
sh:minCount 1 ;
sh:maxCount 1 ;
sh:datatype xsd:integer ;
sh:minInclusive 0 ;
sh:message "Count must be a non-negative integer" ;
] ;
# Must have exactly one unit (and it must be 'Person' for population data)
sh:property [
sh:path ex:unit ;
sh:minCount 1 ;
sh:maxCount 1 ;
sh:hasValue unit:Person ;
sh:message "Unit must be specified as unit:Person" ;
] ;
# Must have exactly one year
sh:property [
sh:path ex:year ;
sh:minCount 1 ;
sh:maxCount 1 ;
sh:datatype xsd:gYear ;
sh:message "Must specify observation year" ;
] ;
# Must be part of a dataset
sh:property [
sh:path ex:partOf ;
sh:minCount 1 ;
sh:class ex:PopulationDataset ;
sh:message "Observation must be part of a dataset" ;
] ;
# Must have EITHER a language OR a category (not both)
sh:xone (
[
sh:property [
sh:path ex:language ;
sh:minCount 1 ;
sh:maxCount 1 ;
sh:class ex:Language ;
]
]
[
sh:property [
sh:path ex:category ;
sh:minCount 1 ;
sh:maxCount 1 ;
]
]
) ;
sh:message "Observation must have exactly one dimension: language OR category" .
# Shape 2: Validate language concepts
ex:LanguageConceptShape a sh:NodeShape ;
sh:targetClass ex:Language ;
rdfs:label "Language Concept Validation" ;
# Must have at least one preferred label
sh:property [
sh:path skos:prefLabel ;
sh:minCount 1 ;
sh:uniqueLang true ;
sh:message "Language must have at least one prefLabel" ;
] ;
# Must be in the language taxonomy
sh:property [
sh:path skos:inScheme ;
sh:minCount 1 ;
sh:hasValue ex:LanguageTaxonomy ;
sh:message "Language must be in ISO 639 taxonomy" ;
] ;
# Must have notation
sh:property [
sh:path skos:notation ;
sh:minCount 1 ;
sh:maxCount 1 ;
sh:message "Language must have ISO 639 notation" ;
] .
# Shape 3: Validate dataset consistency (sum check)
ex:PopulationDatasetShape a sh:NodeShape ;
sh:targetClass ex:PopulationDataset ;
rdfs:label "Dataset Consistency Validation" ;
# SPARQL constraint: language populations must not exceed total
sh:sparql [
sh:message "Sum of language populations exceeds total population" ;
sh:select """
PREFIX ex: <http://example.org/demographics/>
PREFIX qb: <http://purl.org/linked-data/cube#>
SELECT $this ?total ?languageSum (?languageSum - ?total AS ?excess)
WHERE {
# Get total population
$this ex:hasObservation ?totalObs .
?totalObs ex:category ex:TotalPopulation ;
ex:count ?total ;
ex:year ?year .
# Sum all language-specific populations
{
SELECT $this ?year (SUM(?langCount) AS ?languageSum)
WHERE {
$this ex:hasObservation ?langObs .
?langObs ex:language ?lang ;
ex:count ?langCount ;
ex:year ?year .
}
GROUP BY $this ?year
}
# Flag if sum exceeds total
FILTER(?languageSum > ?total)
}
""" ;
] .
# ============================================
# SUPPLEMENTARY DEFINITIONS
# ============================================
ex:TotalPopulation a ex:PopulationCategory ;
rdfs:label "Total Population" ;
rdfs:comment "Undifferentiated total population count" .
ex:USA a ex:GeographicArea ;
rdfs:label "United States of America" .
# Class definitions
ex:PopulationDataset a rdfs:Class ;
rdfs:label "Population Dataset" .
ex:Language a rdfs:Class ;
rdfs:subClassOf skos:Concept ;
rdfs:label "Language" .
ex:PopulationCategory a rdfs:Class ;
rdfs:label "Population Category" .
ex:GeographicArea a rdfs:Class ;
rdfs:label "Geographic Area" .
```
## What This Dataset Demonstrates
**1. The core pattern:**
- One dataset with multiple observations
- Each observation has a count + unit + dimension (language)
- A total observation for comparison
- Controlled vocabulary (SKOS taxonomy) for dimensions
**2. The validation approach:**
- **Shape 1:** Validates individual observations (structure)
- **Shape 2:** Validates taxonomy concepts (vocabulary quality)
- **Shape 3:** Validates cross-observation consistency (the sum check)
**3. Key design decisions:**
- `ex:unit` vs SHACL 1.2's `sh:unit` (shown in shape)
- Language as dimension (ex:language) vs. category (ex:category)
- Using `sh:xone` to enforce "dimension XOR category" constraint
- SPARQL-based sum validation (expensive but necessary)
**4. What validates successfully:**
```
Total: 331,900,000
Sum: 258,300,000 (English)
+ 41,500,000 (Spanish)
+ 3,500,000 (Chinese)
+ 28,600,000 (Other)
= 331,900,000 ✓
# Change ex:obs_english count to trigger sum violation
ex:obs_english ex:count 300000000 . # Was 258,300,000
# Now sum = 373,600,000 > 331,900,000 total
# SPARQL constraint will fire
```
Please view or discuss this issue at https://github.com/w3c/data-shapes/issues/782 using your GitHub account
--
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config
Received on Thursday, 12 February 2026 01:25:28 UTC