Proposal for representing Aggregate Statistical Data from Raphaël Troncy on 2019-06-24 (public-semstats@w3.org from June 2019)

From: Raphaël Troncy <raphael.troncy@eurecom.fr>
Date: Mon, 24 Jun 2019 22:06:25 +0200
To: public-semstats@w3.org
Cc: guha@google.com, "franck.cotton@insee.fr" <franck.cotton@insee.fr>
Message-ID: <75b18b11-5586-d819-ffcc-5f3c82684b14@eurecom.fr>

Dear SemStats community group,

I'm relaying this message from Guha sent today on the schema.org mailing
list.

The proposal can also be discussed at
https://docs.google.com/document/d/139jXakeQk4ChwCkGjqq5wJfCPMDnwIV94oCH-JzJrhM/edit?usp=sharing

Raphaël

-------- Message transféré --------
Sujet : Proposal for representing Aggregate Statistical Data
Date de renvoi : Mon, 24 Jun 2019 19:10:01 +0000
De (renvoi) : public-schemaorg@w3.org
Date : Mon, 24 Jun 2019 12:09:23 -0700
De : Guha <guha@google.com>
Pour : schema.org Mailing List <public-schemaorg@w3.org>

This document can be accessed here.
<https://docs.google.com/document/d/139jXakeQk4ChwCkGjqq5wJfCPMDnwIV94oCH-JzJrhM/edit?usp=sharing>

Look forward to feedback.

Guha

Representing aggregate statistics

Examples of aggregate statistical reports include those from Census
Organizations (e.g., American Community Survey), Health Organizations
(e.g., CDC Wonder) and many others. This is a schema, currently in use
on DataCommons.org for representing facts stated in these reports. This
document describes certain general mechanisms for representing
statistical populations and associated observations. This document will
be followed later by a companion proposal suggesting some basic common
vocabulary useful for representing the kind of data released by the US
Census, CDC, etc.

Our interest is not in describing a data set or mapping columns in csv
files, but in representing the actual data itself. Other efforts have
focused on characterizing data cubes in terms of dimensions, etc. While
we draw upon their work, our goals are different.

Examples of the kind of statistics we would like to represent include:

1. In 2016, there were 1213 people in East Podunk, California, who were
male, married, with a median age of 22.
2. In 2017, there were 20 deaths in Falooda County where the cause of
death was XYZ

We will refer to ‘number of people who are male, hispanic’, ‘number of
deaths where cause of death was XYZ’, etc. as variables. Since the
number of possible variables increases combinatorially, clearly, we
can’t have a properties for each variable (or worse, property for each
variable x years). We need a way of compositional way of constructing
variable references. We use the concept of a StatisticalPopulation to do
this construction.

A StatisticalPopulation is a set of instances of a certain given type
that satisfy some set of constraints. The property populationType is
used specify the type. Any property that can be used on instances of
that type can appear on the statistical population. An instance of
StatisticalPopulation whose populationType is C1, which has the
properties p1, p2, … with values v1, v2, … corresponds to the set of
objects of type C1 what have the property p1 with value v1, property p2
with value v2, etc. The properties numConstraints and
constrainingProperties are used to specify which of the populations
properties are used to specify the population. In the two examples above:

Node: SP1
type: StatisticalPopulation
populationType: Person
location: EastPodunkCalifornia
gender: Male
maritalStatus: Married
numConstraints: 3
constrainingProperties: location, gender, race

Node: SP2
type: StatisticalPopulation
populationType: MortalityEvent
location: FaloodaCounty
causeOfDeath: XYZ
numConstraints: 2
constrainingProperties: location, causeOfDeath

SP1 is an abstract set in the sense that it does not correspond to a
particular set of people who satisfy that constraint at a certain point
in time, but rather, to an abstract specification, about which we can
make observations that are grounded at a particular point in time. We
now turn our attention to the representation of these observations.

Instances of the class Observation are used to specify observations
about an entity (which may or may not be an instance of a
StatisticalPopulation), at a particular time. The principal properties
of an Observation are observedNode, measuredProperty, measuredValue (or
median, etc.) and observationDate (measuredProperty can, but need not
always, be w3c rdf data cube "measure properties", as in lifeExpectancy
example here: https://www.w3.org/TR/vocab-data-cube/#dsd-example.) In
the two examples above:

Node: Obs1
type: Observation
observedNode: SP1
measuredProperty: age
median: “23 years”
observationDate: “2016”

Node: Obs2
type: Observation
observedNode: SP1
measuredProperty: count
measuredValue: 1213
observationDate: “2016”

Node: Obs3
type: Observation
observedNode: SP2
measuredProperty: count
measuredValue: 20
observationDate: “2017”

Observations can also have properties related to the measurement
technique, margin of error, etc. To elaborate on Obs2 above, we can have:

Node: Obs2
type: Observation
observedNode: SP1
measuredProperty: count
measuredValue: 1213
observationDate: “2016”
marginOfError: 22
measurementMethod: CensusACS5yrSurvey

Notes:
1. Care needs to be exercised when querying StatisticalPopulations, to
make sure that the query specifies all the constraining properties.
2. We do not yet have a way of using properties which are named in the
opposite direction e.g. we handle "alumniOf" (relating a person to an
org), but if the only existing property was "alumni" (relating an org to
a person).

Received on Monday, 24 June 2019 20:06:50 UTC