[data-shapes] Unit categorisation. (#782)

kurtcagle has just created a new issue for https://github.com/w3c/data-shapes:

== Unit categorisation. ==
 It MAY be worth also thinking about a predicate such as sh:countShape which is used to identify counts of things (number of people, number of aircraft, etc.). May be a class, but I think a shape would be more flexible here.

Here's the use case I see: I want to represent different populations, say language speakers in a given country. The datatype is obviously non-negative integers, but I may have 1.5M English speakers, 1.3M Spanish speakers, 0.8M French speakers, etc. (The differentiator could be anything - Republicans vs. Democrats, different age groups and so forth). sh:unit really doesn't handle that case, because the unit is a person (or product or place); rather what I think we need is a unitShape or something similar that allows for the assignment of counts to specific taxonomy shapes (English speaker, Democrat).

This is more of a design pattern. One potential solution:

```
@prefix ex: <http://example.org/demographics/> .
@prefix qb: <http://purl.org/linked-data/cube#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix unit: <http://qudt.org/vocab/unit/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .

# ============================================
# DATASET DEFINITION
# ============================================

ex:usaLanguageDataset2023 a ex:PopulationDataset ;
    rdfs:label "USA Population by Primary Language, 2023" ;
    ex:observationYear "2023"^^xsd:gYear ;
    ex:geography ex:USA ;
    ex:hasObservation 
        ex:obs_total,
        ex:obs_english,
        ex:obs_spanish,
        ex:obs_chinese,
        ex:obs_other .

# ============================================
# CONTROLLED VOCABULARY (Language Taxonomy)
# ============================================

ex:LanguageTaxonomy a skos:ConceptScheme ;
    rdfs:label "ISO 639 Language Codes"@en .

ex:English a ex:Language, skos:Concept ;
    skos:prefLabel "English"@en ;
    skos:prefLabel "Anglais"@fr ;
    skos:notation "eng" ;
    skos:inScheme ex:LanguageTaxonomy .

ex:Spanish a ex:Language, skos:Concept ;
    skos:prefLabel "Spanish"@en ;
    skos:prefLabel "Español"@es ;
    skos:notation "spa" ;
    skos:inScheme ex:LanguageTaxonomy .

ex:Chinese a ex:Language, skos:Concept ;
    skos:prefLabel "Chinese"@en ;
    skos:prefLabel "中文"@zh ;
    skos:notation "zho" ;
    skos:inScheme ex:LanguageTaxonomy .

ex:OtherLanguages a ex:Language, skos:Concept ;
    skos:prefLabel "Other Languages"@en ;
    skos:notation "other" ;
    skos:inScheme ex:LanguageTaxonomy ;
    rdfs:comment "Aggregate category for all other languages" .

# ============================================
# OBSERVATIONS (The actual data)
# ============================================

# Total population (baseline for comparison)
ex:obs_total a qb:Observation ;
    rdfs:label "Total USA Population 2023" ;
    ex:category ex:TotalPopulation ;
    ex:count 331900000 ;
    ex:unit unit:Person ;
    ex:geography ex:USA ;
    ex:year "2023"^^xsd:gYear ;
    ex:partOf ex:usaLanguageDataset2023 .

# English speakers
ex:obs_english a qb:Observation ;
    rdfs:label "English speakers in USA, 2023" ;
    ex:language ex:English ;
    ex:count 258300000 ;
    ex:unit unit:Person ;
    ex:geography ex:USA ;
    ex:year "2023"^^xsd:gYear ;
    ex:partOf ex:usaLanguageDataset2023 .

# Spanish speakers
ex:obs_spanish a qb:Observation ;
    rdfs:label "Spanish speakers in USA, 2023" ;
    ex:language ex:Spanish ;
    ex:count 41500000 ;
    ex:unit unit:Person ;
    ex:geography ex:USA ;
    ex:year "2023"^^xsd:gYear ;
    ex:partOf ex:usaLanguageDataset2023 .

# Chinese speakers
ex:obs_chinese a qb:Observation ;
    rdfs:label "Chinese speakers in USA, 2023" ;
    ex:language ex:Chinese ;
    ex:count 3500000 ;
    ex:unit unit:Person ;
    ex:geography ex:USA ;
    ex:year "2023"^^xsd:gYear ;
    ex:partOf ex:usaLanguageDataset2023 .

# Other languages (aggregate)
ex:obs_other a qb:Observation ;
    rdfs:label "Other language speakers in USA, 2023" ;
    ex:language ex:OtherLanguages ;
    ex:count 28600000 ;
    ex:unit unit:Person ;
    ex:geography ex:USA ;
    ex:year "2023"^^xsd:gYear ;
    ex:partOf ex:usaLanguageDataset2023 .

# ============================================
# SHACL VALIDATION SHAPES
# ============================================

# Shape 1: Validate individual observations
ex:PopulationObservationShape a sh:NodeShape ;
    sh:targetClass qb:Observation ;
    rdfs:label "Population Observation Validation" ;
    
    # Must have exactly one count
    sh:property [
        sh:path ex:count ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:datatype xsd:integer ;
        sh:minInclusive 0 ;
        sh:message "Count must be a non-negative integer" ;
    ] ;
    
    # Must have exactly one unit (and it must be 'Person' for population data)
    sh:property [
        sh:path ex:unit ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:hasValue unit:Person ;
        sh:message "Unit must be specified as unit:Person" ;
    ] ;
    
    # Must have exactly one year
    sh:property [
        sh:path ex:year ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:datatype xsd:gYear ;
        sh:message "Must specify observation year" ;
    ] ;
    
    # Must be part of a dataset
    sh:property [
        sh:path ex:partOf ;
        sh:minCount 1 ;
        sh:class ex:PopulationDataset ;
        sh:message "Observation must be part of a dataset" ;
    ] ;
    
    # Must have EITHER a language OR a category (not both)
    sh:xone (
        [
            sh:property [
                sh:path ex:language ;
                sh:minCount 1 ;
                sh:maxCount 1 ;
                sh:class ex:Language ;
            ]
        ]
        [
            sh:property [
                sh:path ex:category ;
                sh:minCount 1 ;
                sh:maxCount 1 ;
            ]
        ]
    ) ;
    sh:message "Observation must have exactly one dimension: language OR category" .

# Shape 2: Validate language concepts
ex:LanguageConceptShape a sh:NodeShape ;
    sh:targetClass ex:Language ;
    rdfs:label "Language Concept Validation" ;
    
    # Must have at least one preferred label
    sh:property [
        sh:path skos:prefLabel ;
        sh:minCount 1 ;
        sh:uniqueLang true ;
        sh:message "Language must have at least one prefLabel" ;
    ] ;
    
    # Must be in the language taxonomy
    sh:property [
        sh:path skos:inScheme ;
        sh:minCount 1 ;
        sh:hasValue ex:LanguageTaxonomy ;
        sh:message "Language must be in ISO 639 taxonomy" ;
    ] ;
    
    # Must have notation
    sh:property [
        sh:path skos:notation ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:message "Language must have ISO 639 notation" ;
    ] .

# Shape 3: Validate dataset consistency (sum check)
ex:PopulationDatasetShape a sh:NodeShape ;
    sh:targetClass ex:PopulationDataset ;
    rdfs:label "Dataset Consistency Validation" ;
    
    # SPARQL constraint: language populations must not exceed total
    sh:sparql [
        sh:message "Sum of language populations exceeds total population" ;
        sh:select """
            PREFIX ex: <http://example.org/demographics/>
            PREFIX qb: <http://purl.org/linked-data/cube#>
            
            SELECT $this ?total ?languageSum (?languageSum - ?total AS ?excess)
            WHERE {
                # Get total population
                $this ex:hasObservation ?totalObs .
                ?totalObs ex:category ex:TotalPopulation ;
                          ex:count ?total ;
                          ex:year ?year .
                
                # Sum all language-specific populations
                {
                    SELECT $this ?year (SUM(?langCount) AS ?languageSum)
                    WHERE {
                        $this ex:hasObservation ?langObs .
                        ?langObs ex:language ?lang ;
                                 ex:count ?langCount ;
                                 ex:year ?year .
                    }
                    GROUP BY $this ?year
                }
                
                # Flag if sum exceeds total
                FILTER(?languageSum > ?total)
            }
        """ ;
    ] .

# ============================================
# SUPPLEMENTARY DEFINITIONS
# ============================================

ex:TotalPopulation a ex:PopulationCategory ;
    rdfs:label "Total Population" ;
    rdfs:comment "Undifferentiated total population count" .

ex:USA a ex:GeographicArea ;
    rdfs:label "United States of America" .

# Class definitions
ex:PopulationDataset a rdfs:Class ;
    rdfs:label "Population Dataset" .

ex:Language a rdfs:Class ;
    rdfs:subClassOf skos:Concept ;
    rdfs:label "Language" .

ex:PopulationCategory a rdfs:Class ;
    rdfs:label "Population Category" .

ex:GeographicArea a rdfs:Class ;
    rdfs:label "Geographic Area" .
```

## What This Dataset Demonstrates

**1. The core pattern:**
- One dataset with multiple observations
- Each observation has a count + unit + dimension (language)
- A total observation for comparison
- Controlled vocabulary (SKOS taxonomy) for dimensions

**2. The validation approach:**
- **Shape 1:** Validates individual observations (structure)
- **Shape 2:** Validates taxonomy concepts (vocabulary quality)
- **Shape 3:** Validates cross-observation consistency (the sum check)

**3. Key design decisions:**
- `ex:unit` vs SHACL 1.2's `sh:unit` (shown in shape)
- Language as dimension (ex:language) vs. category (ex:category)
- Using `sh:xone` to enforce "dimension XOR category" constraint
- SPARQL-based sum validation (expensive but necessary)

**4. What validates successfully:**
```
Total: 331,900,000
Sum:   258,300,000 (English)
     +  41,500,000 (Spanish)
     +   3,500,000 (Chinese)
     +  28,600,000 (Other)
     = 331,900,000 ✓

# Change ex:obs_english count to trigger sum violation
ex:obs_english ex:count 300000000 .  # Was 258,300,000

# Now sum = 373,600,000 > 331,900,000 total
# SPARQL constraint will fire
```




Please view or discuss this issue at https://github.com/w3c/data-shapes/issues/782 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Thursday, 12 February 2026 01:25:28 UTC