W3C home > Mailing lists > Public > public-dxwg-wg@w3.org > July 2017

Re: Note on caveats in statistical data

From: Phil Archer <phila@w3.org>
Date: Fri, 14 Jul 2017 09:18:54 +0100
To: Dan Brickley <danbri@google.com>, Makx Dekkers <mail@makxdekkers.com>
Cc: public-dxwg-wg@w3.org, Bill Roberts <bill@swirrl.com>
Message-ID: <a68462b4-6529-7c92-efdc-9455a0165bbe@w3.org>
+ Bill Roberts

With my faded W3C hat:

Adding Bill to this thread. All being well, he'll be working on a 
Statistical Data on the Web BP doc later this year, I believe working 
with ONS through an EU project. See the proposed charter for the 
continuation of the W3C/OGC collaboration [1].

Without any hat:

The idea of crowd-sourced data scares people who sit on authoritative 
data since:
- it's a threat to their business of selling authoritative data;
- it might actually be better than theirs, which makes them look bad;
and - the one relevant here:
- they want a way to say "hang on, that's not right, this is, and here's 
how we know."

So this discussion is relevant to that. And yes, a way to point to data 
points at the item level and offer a correction or at least an 
annotation would be important.

Phil


[1] http://w3c.github.io/sdw/jwoc/

On 14/07/2017 08:31, Dan Brickley wrote:
> +cc Will fyi (who may not be able to post to this list but saves me
> relaying in one direction)
> 
> On 14 July 2017 at 08:13, Makx Dekkers <mail@makxdekkers.com> wrote:
> 
>> It seems to me that the mention of “an anomalous data point” in the
>> transcript implies that they are interested to annotate down to the level
>> of individual observations, for example, qb:Observation.
>>
> 
> Yes
> 
> 
>> So, they may need to look at a vocabulary like Data Cube to see how such
>> annotations could be included. Maybe dqv:QualityAnnotation
>> https://www.w3.org/TR/vocab-dqv/#dqv:QualityAnnotation could help, but
>> that is defined on the level of dataset, not for individual observations,
>> if I read it right.
>>
> 
> Yes - fine grained but also I believe sometimes applicable across an entire
> time series especially when measuring methodologies, rules or associated
> technology/instrumentation shift with time. The example from Will that
> stuck with me was real world events such as
> https://en.wikipedia.org/wiki/The_Shipman_Inquiry can, especially when
> considering in aggregate e.g. mean, look beautiful in an interactive data
> visualization but tell a fundamentally misleading story unless there is a
> caveat/footnote. Previously I had tended to think of caveats in terms of
> the more intrinsic properties of the dataset and workflow and had missed
> the (rather open-ended) important of also noting relevant real world
> aspects. Specialist journalists and researchers may be aware of these
> "blips" and historical events but the desire is for that knowledge to be
> surfaced and travel along with the raw data, building confidence in its
> reusability and in people's ability to draw and defend actionable
> conclusions from it. (Will I hope will correct me if I'm putting words into
> his mouth).
> 
> 
>> The statistical people themselves are doing stuff around XKOS with
>> Explanatory notes, see http://www.ddialliance.org/
>> Specification/XKOS/1.0/OWL/xkos.html#note-ext.
>>
> 
> Yes, that looks like a possible carrier for this sort of information. I
> don't see there a specific code list for the distinctions Will alludes to
> (real world anomalies versus data recording anomalies etc etc.) but it
> provides a sensible SKOS-based representation that we could use to capture
> the list in a way that would be re-usable in Data Cube, CSVW et al. Would
> this WG or a Community Group be a good place to turn such a list into this
> kind of representation?
> 
> Dan
> 
> 
> 
>>
>>
>> Makx.
>>
>>
>>
>>
>>
>>
>>
>> *From:* Dan Brickley [mailto:danbri@google.com]
>> *Sent:* 14 July 2017 01:05
>> *To:* public-dxwg-wg@w3.org
>> *Subject:* Note on caveats in statistical data
>>
>>
>>
>> Hi. I thought https://www.youtube.com/watch?v=cLMbrzI5p6s might be of
>> interest to the WG. It's a 30 second video from a chat today at Full Fact
>> (UK fact checking charity), with Andy Dudfield from the UK's Office for
>> National Statistics. Andy, Will Moy, Mevan Babakar and I discussed the
>> importance of making sure that caveats of various kinds travel along with
>> the different data format representations of statistical data. Full Fact
>> have done some work in this direction and would be interested in
>> conversations on how it might plug into standards (e.g. CSVW, DCAT,
>> Schema.org etc).
>>
>>
>>
>> I've also just transcribed the video, so here's the text version:
>>
>>
>>
>> (Will Moy) "[re statistical data]... full of numbers, ... what I want to
>> go along with that is a list of things I need to know about those numbers
>> in order to be able to re-use them. And I want those to be organized so
>> instead of just getting a long list of footnotes, those footnotes are
>> classified into the type of caveat it is. So we did a piece of work which
>> is what kind of caveats exist. So - is it an anomalous data point or is
>> it that we changed the methodology or whatever, ... classify it that way,
>> in a machine readable way using a standardized code list so a computer has
>> a reasonable chance of being able to reason about what those numbers can
>> do."
>>
>>
>>
>> I'll share more details of this work as I find out more but it seemed
>> worth making a quick note first.
>>
>>
>>
>> cheers,
>>
>>
>>
>> Dan
>>
> 
Received on Friday, 14 July 2017 08:19:08 UTC

This archive was generated by hypermail 2.3.1 : Monday, 25 March 2019 10:33:18 UTC