Re: what's a dataset? -- Re: Public Data Catalog Priorities and Demand from Chris Beer on 2010-01-04 (public-egov-ig@w3.org from January 2010)

From: Chris Beer <chris-beer@grapevine.net.au>
Date: Mon, 04 Jan 2010 23:22:36 +1100
To: David Pullinger <David.Pullinger@coi.gsi.gov.uk>
CC: Niemann.Brand@epamail.epa.gov, Jose Manuel Alonso <josema.alonso@fundacionctic.org>, Joe Carmel <joe.carmel@comcast.net>, 'Steven Clift' <clift@e-democracy.org>, Antti Poikola <antti.poikola@gmail.com>, sunlightlabs@groups.google.com, Suzanne' 'Acar <suzanne.acar@ic.fbi.gov>, 'Jonathan Gray' <jonathan.gray@okfn.org>, public-egov-ig@w3.org
Message-ID: <4B41DD8C.9020805@grapevine.net.au>
Hey all

Hope everyone had a great holiday season! Well, time has passed, and my 
original reply to this message that is still in draft and was never sent 
off would make little sense now the discussion has progressed - I was 
threatened with various painful events if I even made a passing glance 
at work stuff (W3C especially included) during the Christmas break  ;-) 
Most of my reply was using Antti's original questions as a reference 
point - relevant, but we've moved to more abstract examples. But Antti - 
I must thank you - I now know that there are something like 187,888 
lakes in Finland, and I love little random facts like that :-)

So anyway - on with the discussion. I feel that Brand makes a very good 
point, and a very succinct one at that. And when I take Jose's original 
conundrum, combine it with Brand's reply, and then Davids on from that, 
I can see a nice definition starting to form - which leads me in a full 
circle back to Jose - we've talked datapoints, datasets, databases - all 
the things you'd expect we'd cover. And as I read Jose's message again, 
something strikes me that we are missing from the equation.

Jose said " I tried for a long while, telling myself a dataset should be 
anything that is meaningful as a separate entity and that datasets can 
be combined into super-datasets."

Well why the hell not?! :) What we're talking about here, at the core, 
is normalisation. We do it for datapoints, datasets and databases - and 
now that we are linking up .gov.* data and trying to determine issues 
such as providence, trust etc - prehaps we, or rather, the .gov.* data 
crowd, should take to data normalisation on a larger scale - viewing the 
GD as a whole and removing replication etc to try and obtain a minimum 
number of databases needed to provide all the data. Prehaps also we need 
to step back from the concept of the dataset and define the database a 
little - if a Government links all of its data together perfectly, does 
that in itself create a database. Or from another view - is data.gov.* 
in and of itself, a database?

Defining a dataset should on the face of it, be easy - a dataset is a... 
wait for it.... a Set. Of Data. :-P  Okay - brevity aside though - it 
could be as simple as that. As Paul Benyon-Davies basically puts it - 
Data, and by extension, a dataset, cannot convey meaning. It can't tell 
you anything in and of itself. To turn a dataset (raw symbols) into 
*Information* (ie: symbols that refer to something), you need to analyse 
it someohw - that is, assign it some meaning. It's the *Information*  
that the public commonly see and define as "data" or a "dataset" - a 
particular view of data within a database that conveys meaning.

So really we only need to define a) Data and b) a Database and we thus 
automatically define c) a Dataset - that is - any amount of data (any 
more than 1?) extracted from a database in any structural form (list, 
point data, policy info etc) that is then used to present a particular 
view (query) of the data by assigning meaning to it (analysis) - exactly 
as David describes it - it's just a particular set of data items. There 
seems to be no reason that a dataset can't be created dynamically as 
needed, providing that the database, or rather, the actual data, is 
persistant.

Thoughts? Am we over simplifying things? Personally I think it should be 
as easy as Brand and Paul describe.

Cheers

Chris
Canberra, Australia

(btw - Beynon-Davies bio:

"Paul Beynon-Davies is currently Professor of organisational informatics 
in the Cardiff Business School at Cardiff University. He received his 
BSc in Economics and Social Science and PhD in Computing from University 
of Wales College, Cardiff. He is currently a member of the British 
Computer Society, the Association of Information Systems and the UK 
Academy of Information Systems. Before taking up an academic post he 
worked for several years in the IT industry in the UK both in the public 
and private sectors. Prof. Beynon-Davies has published widely in the 
field of information systems, information management and information 
technology. He has currently published nine books, numerous academic 
papers (Journal papers, Conference Papers) and professional articles on 
topics ranging from the nature of informatics, electronic business, 
electronic government, information systems planning, information systems 
development and database systems. Paul Beynon-Davies still regularly 
acts as a consultant to the public and private sector particularly in 
the area of information and communications technology (ICT) and its 
impact on organisational performance. Over the last decade he has 
engaged in a number of government-funded projects related to the impact 
of ICT on the economic, social and political spheres including an 
evaluation of electronic local government in Wales, the evalution of the 
National Assembly for Wales' Cymru-ar-Lein/Information Age strategy for 
Wales. Between 2006 and 2008 he was director of the eCommerce Innovation 
Centre at Cardiff University which included the Broadband Observatory 
for Wales."

So apparently a bit of a whiz when it comes to this e-government stuff  
;)  Recruit him to the IG. Now. Seriously.

)




David Pullinger wrote:
> Brand,
>  
> I found that helpful.  From my experience at seeing statistical data 
> being gathered together into coherent datasets, I was musing on a 
> description that goes along these lines, expanding up from a single 
> datapoint through larger units:
>  
> - data  (e.g. 24)
> - metadata about that data (e.g. number of additional deaths in 
> England due to sub-zero temperatures in December 2009)
> - bibliographic data (e.g. published January 2010, Office for Health 
> Statistics, subject)
> - contextual metadata (author, contact, etc.)
> - dataset (set of items like those above, e.g. deaths due to 
> environmental causes, )
> - bibliographic data on dataset (dataset published, organisation, etc.)
> - contextual metadata dataset (author, contact, for dataset)
>  
> ...with that data and metadata, of course, being structured in RDF(a) 
> or some equivalent.
>
> Building on this, the following is a prompt list I drew up to help 
> people across government identify data that could be usefully put into 
> re-usable form for third parties:
>  
>
> Types of data that might be helpfully sought out (bearing in mind 
> information and data might fit into a number of these types):
>
>  
>
> A  *Lists* – especially where these are reference lists (i.e. that are 
> used by others as source lists)
>
> Examples:  Ministers, Government Departments, Public Dodies, Regions, 
> Local government bodies, hospitals, schools, courts,  dogs classed as 
> dangerous
>
>  
>
> B  *Point data regularly issued* (time series)
>
> Examples:  Average class size, hospital waiting lists, Gross Domestic 
> Product, violent crime, public service performance, environment quality,
>
>  
>
> C *Policy information*
>
> Examples:  tax bands, benefit determination criteria
>
>  
>
> D  *Datasets collected at one point in time*
>
> Examples, Population census, surveys, research
>
>  
>
> E *Information containing data* of interest that is regularly published
>
> Examples:  Statutory notices, job vacancies, consultations, 
> legislation, contractual opportunities, press releases
>
>  
>
> F Data associated with *location* (i.e. any information with a 
> geographical location, whether or not they also have other dimensions 
> such as time)
>
> Examples:  traffic information, roadworks, Ministerial visits, 
> planning applications, locations of public transport (trains, buses, 
> trams, ferries, etc), address files (non-personal).
>
>  
> Some of these would have a 'dataset' that is a time series, others a 
> list etc. I agree the key is having an ontology that relates the 
> different parts of a dataset in the way that Brand describes.
>  
> Best seasonal greetings,
>  
> David
>  
> David Pullinger
> david.pullinger@coi.gsi.gov.uk <mailto:david.pullinger@coi.gsi.gov.uk>
> Head of Digital Policy
> Central Office of Information
> Hercules House
> 7 Hercules Road
> London SE1 7DU
> 020 7261 8513
> 07788 872321
>  
> Twitter #digigov and blogs:  www.coi.gov.uk/blogs/digigov 
> <http://www.coi.gov.uk/blogs/digigov>
>  
>
> >>> <Niemann.Brand@epamail.epa.gov> 22/12/2009 13:56 >>>
> Jose, This is the way I look at this - well-constructed data tables 
> consist of a combination of data elements that subject matter experts 
> / statisticians agree make sense together (not apples and oranges as 
> we say) and databases consist of multiple data tables that make sense 
> together, even better have an ontology that relates them and all their 
> data elements.
>
> This is what I have been recommending for Data.gov for some time now.
>
> Best wishes for the holiday season. Brand
>
> -----public-egov-ig-request@w3.org wrote: -----
>
>     To: chris-beer@grapevine.net.au
>     From: Jose Manuel Alonso <josema.alonso@fundacionctic.org>
>     Sent by: public-egov-ig-request@w3.org
>     Date: 12/21/2009 01:33PM
>     cc: "Antti Poikola" <antti.poikola@gmail.com>, "Joe Carmel"
>     <joe.carmel@comcast.net>, "'Jonathan Gray'"
>     <jonathan.gray@okfn.org>, "'Steven Clift'"
>     <clift@e-democracy.org>, public-egov-ig@w3.org,
>     sunlightlabs@groups.google.com, "'Acar, Suzanne'"
>     <suzanne.acar@ic.fbi.gov>
>     Subject: what's a dataset? -- Re: Public Data Catalog Priorities
>     and Demand
>
>     >> * If there is, let's say some thousand, datasets in data.gov, is  
>     >> there
>     >> any analysis or wild guesses of how many is missing 10 000, 50
>     000,  
>     >> 100
>     >> 000, 500 000?
>     >
>     > I'd say most, but that there is probably not as many as you'd
>     think.  
>     > And
>     > I'm including data.gov.* in that. (We have to as a group remain
>     > international in focus :) ) Many seem to include "views" of data
>     in  
>     > the
>     > term "missing datasets" - I think that if one could identify what  
>     > datasets
>     > are primary (something I'll expand on once I find what you meant  
>     > above),
>     > then we could generate a lot of other datasets from these. I guess  
>     > what
>     > I'm saying is its probably just as important to ask how many  
>     > datasets out
>     > there
>     > are dependent on other datasets for their data.
>
>     Ok, so I told you this one deserved it's own separate message.
>     This is  
>     something we've been discussing at CTIC for quite a while: what is a  
>     dataset, how would you define it? How would you count how many
>     you've  
>     published?
>
>     Is the "2005 Toxics Release Inventory data for the state of Alaska"  
>     one dataset?
>     Is "Toxics Release Inventory data for the state of Alaska" one
>     dataset?
>     Is "Toxics Release Inventory data for all the states" one dataset?
>
>     If all of the above are datasets (even if not), how many is data.gov  
>     publishing?
>
>     In one of the projects I'm currently involved in, the government is  
>     about to publish information about all the public buildings. Is this  
>     one dataset?
>
>     What if the government publishes just the information of the public  
>     schools? One dataset?
>     Then, the one about hospitals... one dataset?
>     But this two types of buildings (and several other types) are part
>     of  
>     the big dataset, so is this really a dataset or a subset of the big  
>     one? How may should I count? One? Three?
>
>     Unfortunately, I believe I don't have a good answer. I tried for a  
>     long while, telling myself a dataset should be anything that is  
>     meaningful as a separate entity and that datasets can be combined
>     into  
>     super-datasets. Example: public schools is one dataset, hospitals is  
>     another one, public buildings is another one, but are those three  
>     datasets? hmm... maybe we should only count the smaller ones?
>
>     What if instead of hospitals, we talk about "healthcare related  
>     centers" such as: hospitals, ER, GPs, Dentists, Pharmacies,
>     Opticians  
>     (taken from NHS.UK). Hey, we have now six datasets? Or just a big
>     one  
>     and the six smaller ones are just "a class of" the big one...
>
>     Btw, does the number really matter? Or should we just better catalog  
>     in terms of knowledge areas?
>
>     Unless we (at large) can agree on what is a dataset and how they  
>     should be counted, I believe talking about numbers has no much sense.
>
>     Let the discussion (go on) begin... :)
>
>     -- Jose
>
>
>
>
> This communication is confidential and copyright.
> Anyone coming into unauthorised possession of it should disregard its 
> content and erase it from their records.
>
> The original of this email was scanned for viruses by Government 
> Secure Intranet (GSi) virus scanning service supplied exclusively by 
> Cable & Wireless in partnership with MessageLabs.
> On leaving the GSI this email was certified virus free.
> The MessageLabs Anti Virus Service is the first managed service to 
> achieve the CSIA Claims Tested Mark (CCTM Certificate Number 
> 2006/04/0007), the UK Government quality mark initiative for 
> information security products and services. For more information about 
> this please visit www.cctmark.gov.uk
Received on Monday, 4 January 2010 12:23:43 UTC