Re: Note on caveats in statistical data from Dan Brickley on 2017-07-14 (public-dxwg-wg@w3.org from July 2017)

From: Dan Brickley <danbri@google.com>
Date: Fri, 14 Jul 2017 08:31:14 +0100
To: Makx Dekkers <mail@makxdekkers.com>
Cc: public-dxwg-wg@w3.org, Will Moy <william.moy@fullfact.org>
Message-ID: <CAK-qy=50qnSoUa_P_28pqJ4on8kJboziiGCZht_uo6UmJE-eiw@mail.gmail.com>
+cc Will fyi (who may not be able to post to this list but saves me
relaying in one direction)

On 14 July 2017 at 08:13, Makx Dekkers <mail@makxdekkers.com> wrote:

> It seems to me that the mention of “an anomalous data point” in the
> transcript implies that they are interested to annotate down to the level
> of individual observations, for example, qb:Observation.
>

Yes


> So, they may need to look at a vocabulary like Data Cube to see how such
> annotations could be included. Maybe dqv:QualityAnnotation
> https://www.w3.org/TR/vocab-dqv/#dqv:QualityAnnotation could help, but
> that is defined on the level of dataset, not for individual observations,
> if I read it right.
>

Yes - fine grained but also I believe sometimes applicable across an entire
time series especially when measuring methodologies, rules or associated
technology/instrumentation shift with time. The example from Will that
stuck with me was real world events such as
https://en.wikipedia.org/wiki/The_Shipman_Inquiry can, especially when
considering in aggregate e.g. mean, look beautiful in an interactive data
visualization but tell a fundamentally misleading story unless there is a
caveat/footnote. Previously I had tended to think of caveats in terms of
the more intrinsic properties of the dataset and workflow and had missed
the (rather open-ended) important of also noting relevant real world
aspects. Specialist journalists and researchers may be aware of these
"blips" and historical events but the desire is for that knowledge to be
surfaced and travel along with the raw data, building confidence in its
reusability and in people's ability to draw and defend actionable
conclusions from it. (Will I hope will correct me if I'm putting words into
his mouth).


> The statistical people themselves are doing stuff around XKOS with
> Explanatory notes, see http://www.ddialliance.org/
> Specification/XKOS/1.0/OWL/xkos.html#note-ext.
>

Yes, that looks like a possible carrier for this sort of information. I
don't see there a specific code list for the distinctions Will alludes to
(real world anomalies versus data recording anomalies etc etc.) but it
provides a sensible SKOS-based representation that we could use to capture
the list in a way that would be re-usable in Data Cube, CSVW et al. Would
this WG or a Community Group be a good place to turn such a list into this
kind of representation?

Dan



>
>
> Makx.
>
>
>
>
>
>
>
> *From:* Dan Brickley [mailto:danbri@google.com]
> *Sent:* 14 July 2017 01:05
> *To:* public-dxwg-wg@w3.org
> *Subject:* Note on caveats in statistical data
>
>
>
> Hi. I thought https://www.youtube.com/watch?v=cLMbrzI5p6s might be of
> interest to the WG. It's a 30 second video from a chat today at Full Fact
> (UK fact checking charity), with Andy Dudfield from the UK's Office for
> National Statistics. Andy, Will Moy, Mevan Babakar and I discussed the
> importance of making sure that caveats of various kinds travel along with
> the different data format representations of statistical data. Full Fact
> have done some work in this direction and would be interested in
> conversations on how it might plug into standards (e.g. CSVW, DCAT,
> Schema.org etc).
>
>
>
> I've also just transcribed the video, so here's the text version:
>
>
>
> (Will Moy) "[re statistical data]... full of numbers, ... what I want to
> go along with that is a list of things I need to know about those numbers
> in order to be able to re-use them. And I want those to be organized so
> instead of just getting a long list of footnotes, those footnotes are
> classified into the type of caveat it is. So we did a piece of work which
> is what kind of caveats exist. So - is it an anomalous data point or is
> it that we changed the methodology or whatever, ... classify it that way,
> in a machine readable way using a standardized code list so a computer has
> a reasonable chance of being able to reason about what those numbers can
> do."
>
>
>
> I'll share more details of this work as I find out more but it seemed
> worth making a quick note first.
>
>
>
> cheers,
>
>
>
> Dan
>
Received on Friday, 14 July 2017 07:31:42 UTC