Re: Public Data Catalog Priorities and Demand from chris-beer@grapevine.net.au on 2009-12-20 (public-egov-ig@w3.org from December 2009)

From: <chris-beer@grapevine.net.au>
Date: Mon, 21 Dec 2009 10:41:19 +1100 (EST)
To: "Antti Poikola" <antti.poikola@gmail.com>
Cc: "Jose Manuel Alonso" <josema.alonso@fundacionctic.org>, "Joe Carmel" <joe.carmel@comcast.net>, "'Jonathan Gray'" <jonathan.gray@okfn.org>, "'Steven Clift'" <clift@e-democracy.org>, public-egov-ig@w3.org, sunlightlabs@groups.google.com, "'Acar, Suzanne'" <suzanne.acar@ic.fbi.gov>
Message-ID: <51176.165.12.252.113.1261352479.squirrel@webmail.grapevine.com.au>
> Thanks Jose for your "two cents" and for the others that reacted my
> question.
>
> To make it clear, I'm quite looking for a standard cataloging format,
> but the human understandable big picture, a visualization or easy to
> grasp categorization/list of typical PSI datasets, maybe a "map of PSI".
> This discussion developes the questions and indicates that there is no
> clear answer to the questions yet.

Yep :)

> Some more questions...

>
> * How would the "national registry of lakes", "geodata of high voltage
> electric network", "public job vacancies" and "directory of restaurants
> holding licence to serve alcohol" for example relate to the universe of
> PSI?

Interesting questions - and I started writing a long reply addressing it,
but I'm stopping to ask "how do you mean "relates to" - as in where it
would sit? How it would be classified?

>
> * If there is, let's say some thousand, datasets in data.gov, is there
> any analysis or wild guesses of how many is missing 10 000, 50 000, 100
> 000, 500 000?

I'd say most, but that there is probably not as many as you'd think. And
I'm including data.gov.* in that. (We have to as a group remain
international in focus :) ) Many seem to include "views" of data in the
term "missing datasets" - I think that if one could identify what datasets
are primary (something I'll expand on once I find what you meant above),
then we could generate a lot of other datasets from these. I guess what
I'm saying is its probably just as important to ask how many datasets out
there
are dependent on other datasets for their data.

>
> * Is there any analysis what is popularily used and what is pure noise
> and not interesting to any developers, democracy advocates or anybody?

I know in Australia there is - it's something our Bureau of Statistics
looks at. From memory, GIS, Labour Market and Population data are popular,
as is weather information. I'd say it's all interesting to someone. "If
you build it they will come" - ie: What would be popular if people were
able to use any dataset, if they could, or knew how, or knew what to use
it for? Prehaps its just that developers and others haven't come up with
good applications for the data yet.

Looking at Jose's reply below btw - in Australia the "most important" PSI
based on citizen usage of services using the PSI is Weather information.
Followed by (in no order) Tax, Employment and Social Security. Combined
with Spain its not much, but I'm betting that if you got similar
information from other places you'd see a pattern emerge: people are
interesting in information that affects them.

Its like there is a lack of interest data as information for the greater
good, such as the country (any more than they care about the sources and
background of the headline story in the news. They read the story, they
get passionate, they buy a latte and talk about it for 15 minutes with
friends. They go back to work - they don't generally start their own in
depth expose on the story.). There is an assumption that there is someone
out there who knows what to do with that data and is paid well to do so.
And that eventually the important bits will filter down into something
that the individual can use.

Generally they (for instance) won't be interested in things such as
migration rates, annual rainfall, or leading economic indicators, so much
as "Where do *I* get a job", "Will it rain on *my* party this weekend" or
"Can I get a rebate for this on *my* tax" etc.

However when drilling down to this personal application of data, I think
there will always be a need to leverage off that top level primary data
that comes across as "pure noise" in some cases because the uses for it
aren't immediately obvious.

Is it possible or feasible to "weight" datasets as to what is important
and what is not? Does popular = important? Personally I think it's all
vital, and the concept of openess and transparancy means ultimately, where
possible, it all needs to go up, regardless of if it gets used regularily
or not.

(Thought for the day? : Only one person may ever look at a dataset
relating to GIS data and observatories for instance, but that one person
might be Stephen Hawkings, and that one use of that one dataset by that
one person might change the world.)

Cheers

Chris

>
> I found these two analysis about data.gov:
> http://blog.programmableweb.com/2009/07/20/whats-in-datagov/
> http://data-gov.tw.rpi.edu/wiki/File:Data-gov-cloud-200910.png
>
>
>
> Jose Manuel Alonso kirjoitti:
>> My guess based on current experience is that this is not easy to
>> compile. A national (Spain) report on eGov recently released states
>> that the two most important information sets at regional (state) level
>> for citizens are: organization chart and public job vacancies.
> Any link to that?
>
>> Said that, there are much more variables that have an impact in an
>> open data project. We have identified 20+ important ones, some are
>> technical, some are organizational, some are policy-related... it's a
>> tough and complicated issue.
>
> Mind of sharing those 20+ at some wikipage where we could discuss those?
>
>> Just my 2 euro cents :-)
>>
>> -- Jose
>>
>>
>> El 18/12/2009, a las 16:10, Joe Carmel escribió:
>>> I totally agree with you Antti.  I think data.gov and other government
>>> websites should be looking to use a standards-based data cataloging
>>> format
>>> (e.g., extending AtomXML or OPDS) that allows entries link to be data
>>> files
>>> or other catalogs.  Similar to sitemaps and HTML, governments would
>>> publish
>>> a file at the root of their websites that provides a catalog to the
>>> data
>>> files on their site.  By enabling the catalog format to point to other
>>> catalogs, a root catalog could point to sub-department level catalogs
>>> allowing data catalog management responsibilities to be distributed
>>> within
>>> an organization.
>>>
>>> At present, governments use HTML in a variety of ways for data
>>> cataloging.
>>> This looser approach has made it difficult to get one's arms around
>>> all of
>>> the data being published at a given site. (e.g,
>>> http://www.atlantis-press.com/php/download_paper.php?id=1763).  IMO,
>>> if a
>>> standard data catalog format was used it would presumably be with XML
>>> which
>>> would enable individual catalogs to "look" different from one site to
>>> another (using CSS or XSL), but the underlying data structures would
>>> be the
>>> same--allowing for machine readability.
>>>
>>> By providing access to remote data storage, the Internet has been
>>> used to
>>> publish data and documents.  Standard file names (index.htm,
>>> main.htm) are
>>> used as HTML entry points for websites.  The default HTML file then
>>> uses
>>> hypertext links to provide access to subsequent files.  In the same
>>> way HTML
>>> provides links to any file, I believe that standardized catalog files
>>> pointing to sub-catalogs and data files could enable a more
>>> searchable and
>>> usable web of data.
>>>
>>> Joe
>>>
>>> -----Original Message-----
>>> From: public-egov-ig-request@w3.org
>>> [mailto:public-egov-ig-request@w3.org]
>>> On Behalf Of Antti Poikola
>>> Sent: Friday, December 18, 2009 1:10 AM
>>> To: Jonathan Gray
>>> Cc: Steven Clift; public-egov-ig@w3.org; sunlightlabs@groups.google.com
>>> Subject: Re: Public Data Catalog Priorities and Demand
>>>
>>> Hi,
>>>
>>> Please Jonathan, Steven and others, let us know if you find some
>>> visualization, categorization or prioritization that would clarify the
>>> "swamp" of public sector information sources.
>>>
>>> I'm looking for two things:
>>>
>>> 1. A easy way to get the BIG PICTURE of what kind of public sector
>>> information most propably exists (even if it is not open yet)
>>> in a typical country or city.
>>>
>>> 2. Some priorities from the information re-users point of view
>>>
>>> So far I have found only listings and catalogues that can be re-ordered
>>> according to some topics (for example CKAN and data.gov), but these are
>>> not really helping to give the big picture. From this kind of
>>> catalogues
>>> it is easy to find some specific data source if you know what you are
>>> looking for, but if you just want to see what is out there and build
>>> the
>>> overview the catalogues are not so helpful.
>>>
>>> Best regards
>>>
>>> -Antti "Jogi" Poikola
>>>
>>>
>>> Jonathan Gray kirjoitti:
>>>> Just to let you know, we're currently working on this with CKAN.net.
>>>> Also very interested in thinking about how we can track how different
>>>> datasets are reused.
>>>>
>>>> Jonathan
>>>>
>>>> On Mon, Nov 23, 2009 at 4:20 PM, Steven Clift <clift@e-democracy.org>
>>> wrote:
>>>>
>>>>> Has anyone explored what government data is in highest "demand" on
>>>>> the
>>>>> emerging public data reuse sites? How does interest from different
>>>>> re-user audiences vary (e.g.  business, media, open gov advocates,
>>>>> independent coders, etc.)
>>>>>
>>>>> Also, has anyone started a comparsion chart of what different
>>>>> governments are providing? It would be interesting to quickly see
>>>>> what
>>>>> different national or local governments are providing now and over
>>>>> time. This gets to the "what's important" to release for easy reuse
>>>>> versus what is the easiest or least politically sensitive.
>>>>>
>>>>> Steven Clift
>>>>> E-Democracy.org
>>>>>
>>>>> --
>>>>> Steven Clift - http://stevenclift.com
>>>>> Executive Director - http://E-Democracy.Org
>>>>> Follow me - http://twitter.com/democracy
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
>
>
Received on Sunday, 20 December 2009 23:41:51 UTC