W3C home > Mailing lists > Public > public-egov-ig@w3.org > December 2009

Re: what's a dataset? -- Re: Public Data Catalog Priorities and Demand

From: David Pullinger <David.Pullinger@coi.gsi.gov.uk>
Date: Tue, 22 Dec 2009 15:31:13 +0000
Message-Id: <4B30E63C.9179.0047.0@coi.gsi.gov.uk>
To: <Niemann.Brand@epamail.epa.gov>, "Jose Manuel Alonso" <josema.alonso@fundacionctic.org>
Cc: "Joe Carmel" <joe.carmel@comcast.net>, "'Steven Clift'" <clift@e-democracy.org>, "Antti Poikola" <antti.poikola@gmail.com>, <chris-beer@grapevine.net.au>, <sunlightlabs@groups.google.com>, "Suzanne' 'Acar" <suzanne.acar@ic.fbi.gov>, "'Jonathan Gray'" <jonathan.gray@okfn.org>, <public-egov-ig@w3.org>
I found that helpful.  From my experience at seeing statistical data
being gathered together into coherent datasets, I was musing on a
description that goes along these lines, expanding up from a single
datapoint through larger units:
- data  (e.g. 24)
- metadata about that data (e.g. number of additional deaths in England
due to sub-zero temperatures in December 2009)
- bibliographic data (e.g. published January 2010, Office for Health
Statistics, subject)
- contextual metadata (author, contact, etc.)
- dataset (set of items like those above, e.g. deaths due to
environmental causes, )
- bibliographic data on dataset (dataset published, organisation,
- contextual metadata dataset (author, contact, for dataset)
...with that data and metadata, of course, being structured in RDF(a)
or some equivalent.

Building on this, the following is a prompt list I drew up to help
people across government identify data that could be usefully put into
re-usable form for third parties:

Types of data that might be helpfully sought out (bearing in mind
information and data might fit into a number of these types):
A  Lists – especially where these are reference lists (i.e. that are
used by others as source lists)
Examples:  Ministers, Government Departments, Public Dodies, Regions,
Local government bodies, hospitals, schools, courts,  dogs classed as
B  Point data regularly issued (time series)
Examples:  Average class size, hospital waiting lists, Gross Domestic
Product, violent crime, public service performance, environment quality,

C Policy information 
Examples:  tax bands, benefit determination criteria
D  Datasets collected at one point in time
Examples, Population census, surveys, research
E Information containing data of interest that is regularly published 
Examples:  Statutory notices, job vacancies, consultations,
legislation, contractual opportunities, press releases
F Data associated with location (i.e. any information with a
geographical location, whether or not they also have other dimensions
such as time)
Examples:  traffic information, roadworks, Ministerial visits, planning
applications, locations of public transport (trains, buses, trams,
ferries, etc), address files (non-personal).
Some of these would have a 'dataset' that is a time series, others a
list etc. I agree the key is having an ontology that relates the
different parts of a dataset in the way that Brand describes.
Best seasonal greetings,
David Pullinger
Head of Digital Policy
Central Office of Information
Hercules House
7 Hercules Road
London SE1 7DU
020 7261 8513
07788 872321
Twitter #digigov and blogs:  www.coi.gov.uk/blogs/digigov

>>> <Niemann.Brand@epamail.epa.gov> 22/12/2009 13:56 >>>

Jose, This is the way I look at this - well-constructed data tables
consist of a combination of data elements that subject matter experts /
statisticians agree make sense together (not apples and oranges as we
say) and databases consist of multiple data tables that make sense
together, even better have an ontology that relates them and all their
data elements.

This is what I have been recommending for Data.gov for some time now.

Best wishes for the holiday season. Brand

-----public-egov-ig-request@w3.org wrote: -----

To: chris-beer@grapevine.net.au 
From: Jose Manuel Alonso <josema.alonso@fundacionctic.org>
Sent by: public-egov-ig-request@w3.org 
Date: 12/21/2009 01:33PM
cc: "Antti Poikola" <antti.poikola@gmail.com>, "Joe Carmel"
<joe.carmel@comcast.net>, "'Jonathan Gray'" <jonathan.gray@okfn.org>,
"'Steven Clift'" <clift@e-democracy.org>, public-egov-ig@w3.org,
sunlightlabs@groups.google.com, "'Acar, Suzanne'"
Subject: what's a dataset? -- Re: Public Data Catalog Priorities and

>> * If there is, let's say some thousand, datasets in data.gov, is  
>> there
>> any analysis or wild guesses of how many is missing 10 000, 50 000, 

>> 100
>> 000, 500 000?
> I'd say most, but that there is probably not as many as you'd think. 

> And
> I'm including data.gov.* in that. (We have to as a group remain
> international in focus :) ) Many seem to include "views" of data in 

> the
> term "missing datasets" - I think that if one could identify what  
> datasets
> are primary (something I'll expand on once I find what you meant  
> above),
> then we could generate a lot of other datasets from these. I guess  
> what
> I'm saying is its probably just as important to ask how many  
> datasets out
> there
> are dependent on other datasets for their data.

Ok, so I told you this one deserved it's own separate message. This is 

something we've been discussing at CTIC for quite a while: what is a  
dataset, how would you define it? How would you count how many you've 


Is the "2005 Toxics Release Inventory data for the state of Alaska"  
one dataset?
Is "Toxics Release Inventory data for the state of Alaska" one
Is "Toxics Release Inventory data for all the states" one dataset?

If all of the above are datasets (even if not), how many is data.gov  

In one of the projects I'm currently involved in, the government is  
about to publish information about all the public buildings. Is this  
one dataset?

What if the government publishes just the information of the public  
schools? One dataset?
Then, the one about hospitals... one dataset?
But this two types of buildings (and several other types) are part of 

the big dataset, so is this really a dataset or a subset of the big  
one? How may should I count? One? Three?

Unfortunately, I believe I don't have a good answer. I tried for a  
long while, telling myself a dataset should be anything that is  
meaningful as a separate entity and that datasets can be combined into 

super-datasets. Example: public schools is one dataset, hospitals is  
another one, public buildings is another one, but are those three  
datasets? hmm... maybe we should only count the smaller ones?

What if instead of hospitals, we talk about "healthcare related  
centers" such as: hospitals, ER, GPs, Dentists, Pharmacies, Opticians 

(taken from NHS.UK). Hey, we have now six datasets? Or just a big one 

and the six smaller ones are just "a class of" the big one...

Btw, does the number really matter? Or should we just better catalog  
in terms of knowledge areas?

Unless we (at large) can agree on what is a dataset and how they  
should be counted, I believe talking about numbers has no much sense.

Let the discussion (go on) begin... :)

-- Jose

This communication is confidential and copyright.
Anyone coming into unauthorised possession of it should disregard its content and erase it from their records.

The original of this email was scanned for viruses by Government Secure Intranet (GSi) virus scanning service supplied exclusively by Cable & Wireless in partnership with MessageLabs.
On leaving the GSI this email was certified virus free.
The MessageLabs Anti Virus Service is the first managed service to achieve the CSIA Claims Tested Mark (CCTM Certificate Number 2006/04/0007), the UK Government quality mark initiative for information security products and services. For more information about this please visit www.cctmark.gov.uk
Received on Tuesday, 22 December 2009 15:32:49 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:00:42 UTC