RE: Proposal for representing Aggregate Statistical Data

Definitely RDF Data Cube. I’d already passed on this thread to some colleagues in the SDMX/DDI world.

And also to SSN https://www.w3.org/TR/vocab-ssn/ in particular the small vocabulary in the core SOSA namespace.
This addresses an even more fine-grained view point – metadata on a per-value level, but the vocabulary is clearly relevant at the higher aggregation levels – collections of observations which are about the same thing, which I’m working out here: https://w3c.github.io/sdw/proposals/ssn-extensions/

I suggest that it is relatively straightforward, and would be helpful for these terminologies to be aligned, or mappings developed.


In more detail, SOSA formalizes the following terminology


-          Observation – for an act of observing, the result of which is an information item

-          Sampling - for an act of sampling, the result of which is a sample of something bigger, which might be material or a population

-          feature-of-interest – the target of the act of observation or sampling

-          procedure –the re-usable recipe or protocol used for the act ~ schema:measurementTechnique

-          observed-property – ~ schema:variableMeasured

-          phenomenon-time – the world time for the resulting information

-          result-time – the time the information item was generated (not necessarily the same as the phenomenon-time)

etc etc  This language is in turn derived from ISO/OGC O&M which is widely used in the geospatial and earth and environmental sciences.

(and a mapping to schema.org is here  https://github.com/w3c/sdw/blob/gh-pages/ssn/rdf/sosa-sdo-mapping.ttl )

Simon

From: Makx Dekkers [mailto:mail@makxdekkers.com]
Sent: Tuesday, 25 June, 2019 23:49
To: Dan Brickley <danbri@google.com>
Cc: Dataset Exchange Working Group <public-dxwg-wg@w3.org>
Subject: Re: Proposal for representing Aggregate Statistical Data

Dan,

Thanks for this. It seems to me that this is not directly relevant for DCAT, as DCAT does not look very deep into the data itself. It is more akin to Data Cube (https://www.w3.org/TR/vocab-data-cube/) which was specifically designed for "multi-dimensional data, such as statistics" and is compatible with the main standard used for statistical data, SDMX.

Do you know whether the proposed schema.org<http://schema.org> approach is based on Data Cube?

Maybe you could also try to get feedback from Richard Cyganiak and Dave Reynolds, the editors of Data Cube.

Makx.

Op di 25 jun. 2019 om 14:32 schreef Dan Brickley <danbri@google.com<mailto:danbri@google.com>>:

This proposal might be of interest here. It should be consistent with DCAT in its various flavours, as it is more concerned with the content communicated by a statistical dataset. If you have comments please pass them along via Guha or myself, or on public-schemaorg@w3.org<mailto:public-schemaorg@w3.org>

cheers,

Dan

---------- Forwarded message ---------
From: Guha <guha@google.com<mailto:guha@google.com>>
Date: Mon, 24 Jun 2019 at 20:12
Subject: Proposal for representing Aggregate Statistical Data
To: schema.org<http://schema.org> Mailing List <public-schemaorg@w3.org<mailto:public-schemaorg@w3.org>>

This document can be accessed here.<https://docs.google.com/document/d/139jXakeQk4ChwCkGjqq5wJfCPMDnwIV94oCH-JzJrhM/edit?usp=sharing>

Look forward to feedback.

Guha

Representing aggregate statistics


Examples of aggregate statistical reports include those from Census Organizations (e.g., American Community Survey), Health Organizations (e.g., CDC Wonder) and many others. This is a schema, currently in use on DataCommons.org for representing facts stated in these reports. This document describes certain general mechanisms for representing statistical populations and associated observations. This document will be followed later by a companion proposal suggesting some basic common vocabulary useful for representing the kind of data released by the US Census, CDC, etc.

Our interest is not in describing a data set or mapping columns in csv files, but in representing the actual data itself. Other efforts have focused on characterizing data cubes in terms of dimensions, etc. While we draw upon their work, our goals are different.

Examples of the kind of statistics we would like to represent include:

1. In 2016, there were 1213 people in East Podunk, California, who were male, married, with a median age of 22.
2. In 2017, there were 20 deaths in Falooda County where the cause of death was XYZ

We will refer to ‘number of people who are male, hispanic’, ‘number of deaths where cause of death was XYZ’, etc. as variables. Since the number of possible variables increases combinatorially, clearly, we can’t have a properties for each variable (or worse, property for each variable x years). We need a way of compositional way of constructing variable references. We use the concept of a StatisticalPopulation to do this construction.

A StatisticalPopulation is a set of instances of a certain given type that satisfy some set of constraints. The property populationType is used specify the type. Any property that can be used on instances of that type can appear on the statistical population. An instance of StatisticalPopulation whose populationType is C1, which has the properties p1, p2, … with values v1, v2, … corresponds to the set of objects of type C1 what have the property p1 with value v1, property p2 with value v2, etc. The properties numConstraints and constrainingProperties are used to specify which of the populations properties are used to specify the population. In the two examples above:


Node: SP1
type: StatisticalPopulation
populationType: Person
location: EastPodunkCalifornia
gender: Male
maritalStatus: Married
numConstraints: 3
constrainingProperties: location, gender, race


Node: SP2
type: StatisticalPopulation
populationType: MortalityEvent
location:   FaloodaCounty
causeOfDeath: XYZ
numConstraints: 2
constrainingProperties: location, causeOfDeath


SP1 is an abstract set in the sense that it does not correspond to a particular set of people who satisfy that constraint at a certain point in time, but rather, to an abstract specification, about which we can make observations that are grounded at a particular point in time. We now turn our attention to the representation of these observations.

 Instances of the class Observation are used to specify observations about an entity (which may or may not be an instance of a StatisticalPopulation), at a particular time. The principal properties of an Observation are observedNode, measuredProperty, measuredValue (or median, etc.) and observationDate (measuredProperty can, but need not always, be w3c rdf data cube "measure properties", as in lifeExpectancy example here: https://www.w3.org/TR/vocab-data-cube/#dsd-example.) In the two examples above:


Node: Obs1
type: Observation
observedNode: SP1
measuredProperty: age
median: “23 years”
observationDate: “2016”

Node: Obs2
type: Observation
observedNode: SP1
measuredProperty: count
measuredValue: 1213
observationDate: “2016”

Node: Obs3
type: Observation
observedNode: SP2
measuredProperty: count
measuredValue: 20
observationDate: “2017”


Observations can also have properties related to the measurement technique, margin of error, etc. To elaborate on Obs2 above, we can have:

Node: Obs2
type: Observation
observedNode: SP1
measuredProperty: count
measuredValue: 1213
observationDate: “2016”
marginOfError: 22
measurementMethod: CensusACS5yrSurvey


Notes:
1. Care needs to be exercised when querying StatisticalPopulations, to make sure that the query specifies all the constraining properties.
2. We do not yet have a way of using properties which are named in the opposite direction e.g. we handle "alumniOf" (relating a person to an org), but if the only existing property was "alumni" (relating an org to a person).


--
--------------------------------------------------------------------------------
Makx Dekkers
mail@makxdekkers.com<mailto:mail@makxdekkers.com>
--------------------------------------------------------------------------------

Received on Tuesday, 25 June 2019 20:41:30 UTC