- From: David Pullinger <David.Pullinger@coi.gsi.gov.uk>
- Date: Tue, 22 Dec 2009 15:31:13 +0000
- To: <Niemann.Brand@epamail.epa.gov>, "Jose Manuel Alonso" <josema.alonso@fundacionctic.org>
- Cc: "Joe Carmel" <joe.carmel@comcast.net>, "'Steven Clift'" <clift@e-democracy.org>, "Antti Poikola" <antti.poikola@gmail.com>, <chris-beer@grapevine.net.au>, <sunlightlabs@groups.google.com>, "Suzanne' 'Acar" <suzanne.acar@ic.fbi.gov>, "'Jonathan Gray'" <jonathan.gray@okfn.org>, <public-egov-ig@w3.org>
- Message-Id: <4B30E63C.9179.0047.0@coi.gsi.gov.uk>
Brand, I found that helpful. From my experience at seeing statistical data being gathered together into coherent datasets, I was musing on a description that goes along these lines, expanding up from a single datapoint through larger units: - data (e.g. 24) - metadata about that data (e.g. number of additional deaths in England due to sub-zero temperatures in December 2009) - bibliographic data (e.g. published January 2010, Office for Health Statistics, subject) - contextual metadata (author, contact, etc.) - dataset (set of items like those above, e.g. deaths due to environmental causes, ) - bibliographic data on dataset (dataset published, organisation, etc.) - contextual metadata dataset (author, contact, for dataset) ...with that data and metadata, of course, being structured in RDF(a) or some equivalent. Building on this, the following is a prompt list I drew up to help people across government identify data that could be usefully put into re-usable form for third parties: Types of data that might be helpfully sought out (bearing in mind information and data might fit into a number of these types): A Lists – especially where these are reference lists (i.e. that are used by others as source lists) Examples: Ministers, Government Departments, Public Dodies, Regions, Local government bodies, hospitals, schools, courts, dogs classed as dangerous B Point data regularly issued (time series) Examples: Average class size, hospital waiting lists, Gross Domestic Product, violent crime, public service performance, environment quality, C Policy information Examples: tax bands, benefit determination criteria D Datasets collected at one point in time Examples, Population census, surveys, research E Information containing data of interest that is regularly published Examples: Statutory notices, job vacancies, consultations, legislation, contractual opportunities, press releases F Data associated with location (i.e. any information with a geographical location, whether or not they also have other dimensions such as time) Examples: traffic information, roadworks, Ministerial visits, planning applications, locations of public transport (trains, buses, trams, ferries, etc), address files (non-personal). Some of these would have a 'dataset' that is a time series, others a list etc. I agree the key is having an ontology that relates the different parts of a dataset in the way that Brand describes. Best seasonal greetings, David David Pullinger david.pullinger@coi.gsi.gov.uk Head of Digital Policy Central Office of Information Hercules House 7 Hercules Road London SE1 7DU 020 7261 8513 07788 872321 Twitter #digigov and blogs: www.coi.gov.uk/blogs/digigov >>> <Niemann.Brand@epamail.epa.gov> 22/12/2009 13:56 >>> Jose, This is the way I look at this - well-constructed data tables consist of a combination of data elements that subject matter experts / statisticians agree make sense together (not apples and oranges as we say) and databases consist of multiple data tables that make sense together, even better have an ontology that relates them and all their data elements. This is what I have been recommending for Data.gov for some time now. Best wishes for the holiday season. Brand -----public-egov-ig-request@w3.org wrote: ----- To: chris-beer@grapevine.net.au From: Jose Manuel Alonso <josema.alonso@fundacionctic.org> Sent by: public-egov-ig-request@w3.org Date: 12/21/2009 01:33PM cc: "Antti Poikola" <antti.poikola@gmail.com>, "Joe Carmel" <joe.carmel@comcast.net>, "'Jonathan Gray'" <jonathan.gray@okfn.org>, "'Steven Clift'" <clift@e-democracy.org>, public-egov-ig@w3.org, sunlightlabs@groups.google.com, "'Acar, Suzanne'" <suzanne.acar@ic.fbi.gov> Subject: what's a dataset? -- Re: Public Data Catalog Priorities and Demand >> * If there is, let's say some thousand, datasets in data.gov, is >> there >> any analysis or wild guesses of how many is missing 10 000, 50 000, >> 100 >> 000, 500 000? > > I'd say most, but that there is probably not as many as you'd think. > And > I'm including data.gov.* in that. (We have to as a group remain > international in focus :) ) Many seem to include "views" of data in > the > term "missing datasets" - I think that if one could identify what > datasets > are primary (something I'll expand on once I find what you meant > above), > then we could generate a lot of other datasets from these. I guess > what > I'm saying is its probably just as important to ask how many > datasets out > there > are dependent on other datasets for their data. Ok, so I told you this one deserved it's own separate message. This is something we've been discussing at CTIC for quite a while: what is a dataset, how would you define it? How would you count how many you've published? Is the "2005 Toxics Release Inventory data for the state of Alaska" one dataset? Is "Toxics Release Inventory data for the state of Alaska" one dataset? Is "Toxics Release Inventory data for all the states" one dataset? If all of the above are datasets (even if not), how many is data.gov publishing? In one of the projects I'm currently involved in, the government is about to publish information about all the public buildings. Is this one dataset? What if the government publishes just the information of the public schools? One dataset? Then, the one about hospitals... one dataset? But this two types of buildings (and several other types) are part of the big dataset, so is this really a dataset or a subset of the big one? How may should I count? One? Three? Unfortunately, I believe I don't have a good answer. I tried for a long while, telling myself a dataset should be anything that is meaningful as a separate entity and that datasets can be combined into super-datasets. Example: public schools is one dataset, hospitals is another one, public buildings is another one, but are those three datasets? hmm... maybe we should only count the smaller ones? What if instead of hospitals, we talk about "healthcare related centers" such as: hospitals, ER, GPs, Dentists, Pharmacies, Opticians (taken from NHS.UK). Hey, we have now six datasets? Or just a big one and the six smaller ones are just "a class of" the big one... Btw, does the number really matter? Or should we just better catalog in terms of knowledge areas? Unless we (at large) can agree on what is a dataset and how they should be counted, I believe talking about numbers has no much sense. Let the discussion (go on) begin... :) -- Jose This communication is confidential and copyright. Anyone coming into unauthorised possession of it should disregard its content and erase it from their records. The original of this email was scanned for viruses by Government Secure Intranet (GSi) virus scanning service supplied exclusively by Cable & Wireless in partnership with MessageLabs. On leaving the GSI this email was certified virus free. The MessageLabs Anti Virus Service is the first managed service to achieve the CSIA Claims Tested Mark (CCTM Certificate Number 2006/04/0007), the UK Government quality mark initiative for information security products and services. For more information about this please visit www.cctmark.gov.uk
Received on Tuesday, 22 December 2009 15:32:49 UTC