what's a dataset? -- Re: Public Data Catalog Priorities and Demand

>> * If there is, let's say some thousand, datasets in data.gov, is  
>> there
>> any analysis or wild guesses of how many is missing 10 000, 50 000,  
>> 100
>> 000, 500 000?
>
> I'd say most, but that there is probably not as many as you'd think.  
> And
> I'm including data.gov.* in that. (We have to as a group remain
> international in focus :) ) Many seem to include "views" of data in  
> the
> term "missing datasets" - I think that if one could identify what  
> datasets
> are primary (something I'll expand on once I find what you meant  
> above),
> then we could generate a lot of other datasets from these. I guess  
> what
> I'm saying is its probably just as important to ask how many  
> datasets out
> there
> are dependent on other datasets for their data.

Ok, so I told you this one deserved it's own separate message. This is  
something we've been discussing at CTIC for quite a while: what is a  
dataset, how would you define it? How would you count how many you've  
published?

Is the "2005 Toxics Release Inventory data for the state of Alaska"  
one dataset?
Is "Toxics Release Inventory data for the state of Alaska" one dataset?
Is "Toxics Release Inventory data for all the states" one dataset?

If all of the above are datasets (even if not), how many is data.gov  
publishing?

In one of the projects I'm currently involved in, the government is  
about to publish information about all the public buildings. Is this  
one dataset?

What if the government publishes just the information of the public  
schools? One dataset?
Then, the one about hospitals... one dataset?
But this two types of buildings (and several other types) are part of  
the big dataset, so is this really a dataset or a subset of the big  
one? How may should I count? One? Three?

Unfortunately, I believe I don't have a good answer. I tried for a  
long while, telling myself a dataset should be anything that is  
meaningful as a separate entity and that datasets can be combined into  
super-datasets. Example: public schools is one dataset, hospitals is  
another one, public buildings is another one, but are those three  
datasets? hmm... maybe we should only count the smaller ones?

What if instead of hospitals, we talk about "healthcare related  
centers" such as: hospitals, ER, GPs, Dentists, Pharmacies, Opticians  
(taken from NHS.UK). Hey, we have now six datasets? Or just a big one  
and the six smaller ones are just "a class of" the big one...

Btw, does the number really matter? Or should we just better catalog  
in terms of knowledge areas?

Unless we (at large) can agree on what is a dataset and how they  
should be counted, I believe talking about numbers has no much sense.

Let the discussion (go on) begin... :)

-- Jose

Received on Monday, 21 December 2009 18:34:07 UTC