W3C home > Mailing lists > Public > public-egov-ig@w3.org > December 2009

[Fwd: Re: Public Data Catalog Priorities and Demand]

From: Chris Beer <chris-beer@grapevine.net.au>
Date: Tue, 22 Dec 2009 21:29:57 +1100
Message-ID: <4B309FA5.8070501@grapevine.net.au>
To: W3C e-Gov IG <public-egov-ig@w3.org>
All

Fowarding this on (mostly for Joses benefit). Orginally I forgot to 
Reply-To-All when replying to Antti and so I copied and pasted my reply 
to the list. Antti replied directly back to me, so I'm forwarding that 
on for the benefit of the ongoing discussion. Plus it will make my 
replies make sense! :)

Cheers

Chris

====================================

"Hello Chris,

I'm glad to read your comments, I'm learning a lot.

Just to give you background: I'm currently writing a guide book about 
opening up PSI in Finland (for the Ministry of communications and 
transportation). I also help the Finnish government in their early ideas 
about the national data catalog.

>> * How would the "national registry of lakes", "geodata of high voltage
>> electric network", "public job vacancies" and "directory of restaurants
>> holding licence to serve alcohol" for example relate to the universe of
>> PSI?
>>     
>
> Interesting questions - and I started writing a long reply addressing it,
> but I'm stopping to ask "how do you mean "relates to" - as in where it
> would sit? How it would be classified?
>   

By saying "relates to" I mean that, how those examples are parts of the 
whole.

>> * If there is, let's say some thousand, datasets in data.gov, is there
>> any analysis or wild guesses of how many is missing 10 000, 50 000, 100
>> 000, 500 000?
>>     
>
> I'd say most, but that there is probably not as many as you'd think. And
> I'm including data.gov.* in that. (We have to as a group remain
> international in focus :) )

Definitely international focus! I just used data.gov as an well known 
example, but I'm also familiar with the http://data.australia.gov.au and 
many others. It would be nice to see some comparison between the content 
of different national catalogs, what is similar and what is different etc.

> Many seem to include "views" of data in the
> term "missing datasets" - I think that if one could identify what 
> datasets
> are primary (something I'll expand on once I find what you meant above),
> then we could generate a lot of other datasets from these. I guess what
> I'm saying is its probably just as important to ask how many datasets out
> there
> are dependent on other datasets for their data.
>   

That is very good point and leeds to the observation that most 
organizations that are often seen as data providers actually consume 
data from other sources, add something of their own, agregate, mix, 
analyse etc. and then produce new datasets.

The rule of thumb in opening the data seems to be that is should be 
opened as close to raw/primary data source as possible.

>> * Is there any analysis what is popularily used and what is pure noise
>> and not interesting to any developers, democracy advocates or anybody?
>>     
>
> I know in Australia there is - it's something our Bureau of Statistics
> looks at. From memory, GIS, Labour Market and Population data are 
> popular,
> as is weather information. I'd say it's all interesting to someone. "If
> you build it they will come" - ie: What would be popular if people were
> able to use any dataset, if they could, or knew how, or knew what to use
> it for? Prehaps its just that developers and others haven't come up with
> good applications for the data yet.
>   

Yes, I agree with the notion that "it is all usefull to someone", but 
for practical reasons some sort of importance priorities would be 
usefull. It's true that these priorities are easily biased by the 
existing open datasets that are popular, because they have been open for 
some time and developers have had time to develope their ideas.

> Looking at Jose's reply below btw - in Australia the "most important" 
> PSI based on citizen usage of services using the PSI is Weather 
> information.
> Followed by (in no order) Tax, Employment and Social Security. Combined
> with Spain its not much, but I'm betting that if you got similar
> information from other places you'd see a pattern emerge: people are
> interesting in information that affects them.
>   

 From Finland I would add to the list public transportation data and in 
the States the crime data seem to be popular for some (maybe cultural) 
reason.

> Its like there is a lack of interest data as information for the greater
> good, such as the country (any more than they care about the sources and
> background of the headline story in the news. They read the story, they
> get passionate, they buy a latte and talk about it for 15 minutes with
> friends. They go back to work - they don't generally start their own in
> depth expose on the story.). There is an assumption that there is someone
> out there who knows what to do with that data and is paid well to do so.
> And that eventually the important bits will filter down into something
> that the individual can use.
>
> Generally they (for instance) won't be interested in things such as
> migration rates, annual rainfall, or leading economic indicators, so much
> as "Where do *I* get a job", "Will it rain on *my* party this weekend" or
> "Can I get a rebate for this on *my* tax" etc.
>
> However when drilling down to this personal application of data, I think
> there will always be a need to leverage off that top level primary data
> that comes across as "pure noise" in some cases because the uses for it
> aren't immediately obvious.
>
> Is it possible or feasible to "weight" datasets as to what is important
> and what is not? Does popular = important? Personally I think it's all
> vital, and the concept of openess and transparancy means ultimately, 
> where
> possible, it all needs to go up, regardless of if it gets used regularily
> or not.
>   

I would say that popular is not equal to important. The ecosystem of 
open data (network of people and organizations using and producing data) 
is evolving now fast. First comes the show-cases from popular data and a 
bit later the really important cases hopefully follow.

> (Thought for the day? : Only one person may ever look at a dataset
> relating to GIS data and observatories for instance, but that one person
> might be Stephen Hawkings, and that one use of that one dataset by that
> one person might change the world.)
>   

Thanks for your thoughts and hope to continue the discussion with you. 
As I said before, I very much like the ideology behind your Stephen 
Hawkings story. Newer the less, my inner pragmatist keeps asking for 
some priorities because "seeing is believing" and the movement of open 
data gets faster when there are some visible show-cases.

-Antti Poikola"

> Cheers
>
> Chris
>
>  
>> I found these two analysis about data.gov:
>> http://blog.programmableweb.com/2009/07/20/whats-in-datagov/
>> http://data-gov.tw.rpi.edu/wiki/File:Data-gov-cloud-200910.png
>>
>>
>>
>> Jose Manuel Alonso kirjoitti:
>>    
>>> My guess based on current experience is that this is not easy to
>>> compile. A national (Spain) report on eGov recently released states
>>> that the two most important information sets at regional (state) level
>>> for citizens are: organization chart and public job vacancies.
>>>       
>> Any link to that?
>>
>>    
>>> Said that, there are much more variables that have an impact in an
>>> open data project. We have identified 20+ important ones, some are
>>> technical, some are organizational, some are policy-related... it's a
>>> tough and complicated issue.
>>>       
>> Mind of sharing those 20+ at some wikipage where we could discuss those?
>>
>>    
>>> Just my 2 euro cents :-)
>>>
>>> -- Jose
>>>
>>>
>>> El 18/12/2009, a las 16:10, Joe Carmel escribió:
>>>      
>>>> I totally agree with you Antti.  I think data.gov and other government
>>>> websites should be looking to use a standards-based data cataloging
>>>> format
>>>> (e.g., extending AtomXML or OPDS) that allows entries link to be data
>>>> files
>>>> or other catalogs.  Similar to sitemaps and HTML, governments would
>>>> publish
>>>> a file at the root of their websites that provides a catalog to the
>>>> data
>>>> files on their site.  By enabling the catalog format to point to other
>>>> catalogs, a root catalog could point to sub-department level catalogs
>>>> allowing data catalog management responsibilities to be distributed
>>>> within
>>>> an organization.
>>>>
>>>> At present, governments use HTML in a variety of ways for data
>>>> cataloging.
>>>> This looser approach has made it difficult to get one's arms around
>>>> all of
>>>> the data being published at a given site. (e.g,
>>>> http://www.atlantis-press.com/php/download_paper.php?id=1763).  IMO,
>>>> if a
>>>> standard data catalog format was used it would presumably be with XML
>>>> which
>>>> would enable individual catalogs to "look" different from one site to
>>>> another (using CSS or XSL), but the underlying data structures would
>>>> be the
>>>> same--allowing for machine readability.
>>>>
>>>> By providing access to remote data storage, the Internet has been
>>>> used to
>>>> publish data and documents.  Standard file names (index.htm,
>>>> main.htm) are
>>>> used as HTML entry points for websites.  The default HTML file then
>>>> uses
>>>> hypertext links to provide access to subsequent files.  In the same
>>>> way HTML
>>>> provides links to any file, I believe that standardized catalog files
>>>> pointing to sub-catalogs and data files could enable a more
>>>> searchable and
>>>> usable web of data.
>>>>
>>>> Joe
>>>>
>>>> -----Original Message-----
>>>> From: public-egov-ig-request@w3.org
>>>> [mailto:public-egov-ig-request@w3.org]
>>>> On Behalf Of Antti Poikola
>>>> Sent: Friday, December 18, 2009 1:10 AM
>>>> To: Jonathan Gray
>>>> Cc: Steven Clift; public-egov-ig@w3.org; 
>>>> sunlightlabs@groups.google.com
>>>> Subject: Re: Public Data Catalog Priorities and Demand
>>>>
>>>> Hi,
>>>>
>>>> Please Jonathan, Steven and others, let us know if you find some
>>>> visualization, categorization or prioritization that would clarify the
>>>> "swamp" of public sector information sources.
>>>>
>>>> I'm looking for two things:
>>>>
>>>> 1. A easy way to get the BIG PICTURE of what kind of public sector
>>>> information most propably exists (even if it is not open yet)
>>>> in a typical country or city.
>>>>
>>>> 2. Some priorities from the information re-users point of view
>>>>
>>>> So far I have found only listings and catalogues that can be 
>>>> re-ordered
>>>> according to some topics (for example CKAN and data.gov), but these 
>>>> are
>>>> not really helping to give the big picture. From this kind of
>>>> catalogues
>>>> it is easy to find some specific data source if you know what you are
>>>> looking for, but if you just want to see what is out there and build
>>>> the
>>>> overview the catalogues are not so helpful.
>>>>
>>>> Best regards
>>>>
>>>> -Antti "Jogi" Poikola
>>>>
>>>>
>>>> Jonathan Gray kirjoitti:
>>>>        
>>>>> Just to let you know, we're currently working on this with CKAN.net.
>>>>> Also very interested in thinking about how we can track how different
>>>>> datasets are reused.
>>>>>
>>>>> Jonathan
>>>>>
>>>>> On Mon, Nov 23, 2009 at 4:20 PM, Steven Clift <clift@e-democracy.org>
>>>>>           
>>>> wrote:
>>>>        
>>>>>> Has anyone explored what government data is in highest "demand" on
>>>>>> the
>>>>>> emerging public data reuse sites? How does interest from different
>>>>>> re-user audiences vary (e.g.  business, media, open gov advocates,
>>>>>> independent coders, etc.)
>>>>>>
>>>>>> Also, has anyone started a comparsion chart of what different
>>>>>> governments are providing? It would be interesting to quickly see
>>>>>> what
>>>>>> different national or local governments are providing now and over
>>>>>> time. This gets to the "what's important" to release for easy reuse
>>>>>> versus what is the easiest or least politically sensitive.
>>>>>>
>>>>>> Steven Clift
>>>>>> E-Democracy.org
>>>>>>
>>>>>> -- 
>>>>>> Steven Clift - http://stevenclift.com
>>>>>> Executive Director - http://E-Democracy.Org
>>>>>> Follow me - http://twitter.com/democracy
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>>
>>>>>           
>>>>
>>>>         
>>>       
>>
>>     
>
>
>
>   




attached mail follows:


Hello Chris,

I'm glad to read your comments, I'm learning a lot.

Just to give you background: I'm currently writing a guide book about 
opening up PSI in Finland (for the Ministry of communications and 
transportation). I also help the Finnish government in their early ideas 
about the national data catalog.

>> * How would the "national registry of lakes", "geodata of high voltage
>> electric network", "public job vacancies" and "directory of restaurants
>> holding licence to serve alcohol" for example relate to the universe of
>> PSI?
>>     
>
> Interesting questions - and I started writing a long reply addressing it,
> but I'm stopping to ask "how do you mean "relates to" - as in where it
> would sit? How it would be classified?
>   

By saying "relates to" I mean that, how those examples are parts of the 
whole.

>> * If there is, let's say some thousand, datasets in data.gov, is there
>> any analysis or wild guesses of how many is missing 10 000, 50 000, 100
>> 000, 500 000?
>>     
>
> I'd say most, but that there is probably not as many as you'd think. And
> I'm including data.gov.* in that. (We have to as a group remain
> international in focus :) )

Definitely international focus! I just used data.gov as an well known 
example, but I'm also familiar with the http://data.australia.gov.au and 
many others. It would be nice to see some comparison between the content 
of different national catalogs, what is similar and what is different etc.

> Many seem to include "views" of data in the
> term "missing datasets" - I think that if one could identify what datasets
> are primary (something I'll expand on once I find what you meant above),
> then we could generate a lot of other datasets from these. I guess what
> I'm saying is its probably just as important to ask how many datasets out
> there
> are dependent on other datasets for their data.
>   

That is very good point and leeds to the observation that most 
organizations that are often seen as data providers actually consume 
data from other sources, add something of their own, agregate, mix, 
analyse etc. and then produce new datasets.

The rule of thumb in opening the data seems to be that is should be 
opened as close to raw/primary data source as possible.

>> * Is there any analysis what is popularily used and what is pure noise
>> and not interesting to any developers, democracy advocates or anybody?
>>     
>
> I know in Australia there is - it's something our Bureau of Statistics
> looks at. From memory, GIS, Labour Market and Population data are popular,
> as is weather information. I'd say it's all interesting to someone. "If
> you build it they will come" - ie: What would be popular if people were
> able to use any dataset, if they could, or knew how, or knew what to use
> it for? Prehaps its just that developers and others haven't come up with
> good applications for the data yet.
>   

Yes, I agree with the notion that "it is all usefull to someone", but 
for practical reasons some sort of importance priorities would be 
usefull. It's true that these priorities are easily biased by the 
existing open datasets that are popular, because they have been open for 
some time and developers have had time to develope their ideas.

> Looking at Jose's reply below btw - in Australia the "most important" PSI 
> based on citizen usage of services using the PSI is Weather information.
> Followed by (in no order) Tax, Employment and Social Security. Combined
> with Spain its not much, but I'm betting that if you got similar
> information from other places you'd see a pattern emerge: people are
> interesting in information that affects them.
>   

 From Finland I would add to the list public transportation data and in 
the States the crime data seem to be popular for some (maybe cultural) 
reason.

> Its like there is a lack of interest data as information for the greater
> good, such as the country (any more than they care about the sources and
> background of the headline story in the news. They read the story, they
> get passionate, they buy a latte and talk about it for 15 minutes with
> friends. They go back to work - they don't generally start their own in
> depth expose on the story.). There is an assumption that there is someone
> out there who knows what to do with that data and is paid well to do so.
> And that eventually the important bits will filter down into something
> that the individual can use.
>
> Generally they (for instance) won't be interested in things such as
> migration rates, annual rainfall, or leading economic indicators, so much
> as "Where do *I* get a job", "Will it rain on *my* party this weekend" or
> "Can I get a rebate for this on *my* tax" etc.
>
> However when drilling down to this personal application of data, I think
> there will always be a need to leverage off that top level primary data
> that comes across as "pure noise" in some cases because the uses for it
> aren't immediately obvious.
>
> Is it possible or feasible to "weight" datasets as to what is important
> and what is not? Does popular = important? Personally I think it's all
> vital, and the concept of openess and transparancy means ultimately, where
> possible, it all needs to go up, regardless of if it gets used regularily
> or not.
>   

I would say that popular is not equal to important. The ecosystem of 
open data (network of people and organizations using and producing data) 
is evolving now fast. First comes the show-cases from popular data and a 
bit later the really important cases hopefully follow.

> (Thought for the day? : Only one person may ever look at a dataset
> relating to GIS data and observatories for instance, but that one person
> might be Stephen Hawkings, and that one use of that one dataset by that
> one person might change the world.)
>   

Thanks for your thoughts and hope to continue the discussion with you. 
As I said before, I very much like the ideology behind your Stephen 
Hawkings story. Newer the less, my inner pragmatist keeps asking for 
some priorities because "seeing is believing" and the movement of open 
data gets faster when there are some visible show-cases.

-Antti Poikola

> Cheers
>
> Chris
>
>   
>> I found these two analysis about data.gov:
>> http://blog.programmableweb.com/2009/07/20/whats-in-datagov/
>> http://data-gov.tw.rpi.edu/wiki/File:Data-gov-cloud-200910.png
>>
>>
>>
>> Jose Manuel Alonso kirjoitti:
>>     
>>> My guess based on current experience is that this is not easy to
>>> compile. A national (Spain) report on eGov recently released states
>>> that the two most important information sets at regional (state) level
>>> for citizens are: organization chart and public job vacancies.
>>>       
>> Any link to that?
>>
>>     
>>> Said that, there are much more variables that have an impact in an
>>> open data project. We have identified 20+ important ones, some are
>>> technical, some are organizational, some are policy-related... it's a
>>> tough and complicated issue.
>>>       
>> Mind of sharing those 20+ at some wikipage where we could discuss those?
>>
>>     
>>> Just my 2 euro cents :-)
>>>
>>> -- Jose
>>>
>>>
>>> El 18/12/2009, a las 16:10, Joe Carmel escribió:
>>>       
>>>> I totally agree with you Antti.  I think data.gov and other government
>>>> websites should be looking to use a standards-based data cataloging
>>>> format
>>>> (e.g., extending AtomXML or OPDS) that allows entries link to be data
>>>> files
>>>> or other catalogs.  Similar to sitemaps and HTML, governments would
>>>> publish
>>>> a file at the root of their websites that provides a catalog to the
>>>> data
>>>> files on their site.  By enabling the catalog format to point to other
>>>> catalogs, a root catalog could point to sub-department level catalogs
>>>> allowing data catalog management responsibilities to be distributed
>>>> within
>>>> an organization.
>>>>
>>>> At present, governments use HTML in a variety of ways for data
>>>> cataloging.
>>>> This looser approach has made it difficult to get one's arms around
>>>> all of
>>>> the data being published at a given site. (e.g,
>>>> http://www.atlantis-press.com/php/download_paper.php?id=1763).  IMO,
>>>> if a
>>>> standard data catalog format was used it would presumably be with XML
>>>> which
>>>> would enable individual catalogs to "look" different from one site to
>>>> another (using CSS or XSL), but the underlying data structures would
>>>> be the
>>>> same--allowing for machine readability.
>>>>
>>>> By providing access to remote data storage, the Internet has been
>>>> used to
>>>> publish data and documents.  Standard file names (index.htm,
>>>> main.htm) are
>>>> used as HTML entry points for websites.  The default HTML file then
>>>> uses
>>>> hypertext links to provide access to subsequent files.  In the same
>>>> way HTML
>>>> provides links to any file, I believe that standardized catalog files
>>>> pointing to sub-catalogs and data files could enable a more
>>>> searchable and
>>>> usable web of data.
>>>>
>>>> Joe
>>>>
>>>> -----Original Message-----
>>>> From: public-egov-ig-request@w3.org
>>>> [mailto:public-egov-ig-request@w3.org]
>>>> On Behalf Of Antti Poikola
>>>> Sent: Friday, December 18, 2009 1:10 AM
>>>> To: Jonathan Gray
>>>> Cc: Steven Clift; public-egov-ig@w3.org; sunlightlabs@groups.google.com
>>>> Subject: Re: Public Data Catalog Priorities and Demand
>>>>
>>>> Hi,
>>>>
>>>> Please Jonathan, Steven and others, let us know if you find some
>>>> visualization, categorization or prioritization that would clarify the
>>>> "swamp" of public sector information sources.
>>>>
>>>> I'm looking for two things:
>>>>
>>>> 1. A easy way to get the BIG PICTURE of what kind of public sector
>>>> information most propably exists (even if it is not open yet)
>>>> in a typical country or city.
>>>>
>>>> 2. Some priorities from the information re-users point of view
>>>>
>>>> So far I have found only listings and catalogues that can be re-ordered
>>>> according to some topics (for example CKAN and data.gov), but these are
>>>> not really helping to give the big picture. From this kind of
>>>> catalogues
>>>> it is easy to find some specific data source if you know what you are
>>>> looking for, but if you just want to see what is out there and build
>>>> the
>>>> overview the catalogues are not so helpful.
>>>>
>>>> Best regards
>>>>
>>>> -Antti "Jogi" Poikola
>>>>
>>>>
>>>> Jonathan Gray kirjoitti:
>>>>         
>>>>> Just to let you know, we're currently working on this with CKAN.net.
>>>>> Also very interested in thinking about how we can track how different
>>>>> datasets are reused.
>>>>>
>>>>> Jonathan
>>>>>
>>>>> On Mon, Nov 23, 2009 at 4:20 PM, Steven Clift <clift@e-democracy.org>
>>>>>           
>>>> wrote:
>>>>         
>>>>>> Has anyone explored what government data is in highest "demand" on
>>>>>> the
>>>>>> emerging public data reuse sites? How does interest from different
>>>>>> re-user audiences vary (e.g.  business, media, open gov advocates,
>>>>>> independent coders, etc.)
>>>>>>
>>>>>> Also, has anyone started a comparsion chart of what different
>>>>>> governments are providing? It would be interesting to quickly see
>>>>>> what
>>>>>> different national or local governments are providing now and over
>>>>>> time. This gets to the "what's important" to release for easy reuse
>>>>>> versus what is the easiest or least politically sensitive.
>>>>>>
>>>>>> Steven Clift
>>>>>> E-Democracy.org
>>>>>>
>>>>>> --
>>>>>> Steven Clift - http://stevenclift.com
>>>>>> Executive Director - http://E-Democracy.Org
>>>>>> Follow me - http://twitter.com/democracy
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>>
>>>>>           
>>>>
>>>>         
>>>       
>>
>>     
>
>
>
>   
Received on Tuesday, 22 December 2009 10:30:21 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 22 December 2009 10:30:23 GMT