- From: Ed Summers <ehs@pobox.com>
- Date: Tue, 24 Aug 2010 09:21:58 -0400
- To: public-egov-ig <public-egov-ig@w3.org>
This post by Leigh Dodds to the uk-government-data-developers list about source.data.gov.uk should be of potential interest. It's great to see a methodical approach to making data dumps for data.gov.uk available. //Ed ---------- Forwarded message ---------- From: Leigh Dodds <leigh.dodds@talis.com> Date: Tue, Aug 24, 2010 at 8:52 AM Subject: [uk-government-data-developers] Data Dumps at source.data.gov.uk To: uk-government-data-developers <uk-government-data-developers@googlegroups.com> Hi, I've just put together an initial set of data dumps for the majority of the Linked Data currently being published by data.gov.uk. More information on what's not included and why in a moment. (Disclaimer: what follows is my understanding of the current state of play, so any errors/omissions then blame me :) THE REPOSITORY There is a server at http://source.data.gov.uk which has been set up to provide access to both data dumps and (eventually) the code used to generate/convert the data. The data dumps can be found at: http://source.data.gov.uk/data/ The intention is to create a repository of versioned datasets that will allow anyone to mirror the data for their own use/purposes, e.g. to perform local analysis or to host in your own triple store. Over time this repository should become a complete archival copy of all of the Linked Data that is published through data.gov.uk, complete with information on the provenance of individual datasets. The team behind data.gov.uk are still working through a number of the best practices, so right now I've simply put up copies of all the currently live datasets. HOW THE DATA IS ORGANISED The web archive is organised into a series of sub-directories: * Sector — top-level sector. E.g. as used in *.data.gov.uk * Dataset — dataset directory, a short identifier for the dataset. I've made some of these up at present * Date-stamped directory — in format of yyyy-mm-dd. * Data files — This may be an number of data files in different formats. E.g the data may span a number of small files, some files may be ntriples for loading into default graph and some files may be nquads. For example, the RDF version of Edubase currently available from http://education.data.gov.uk can be found here: http://source.data.gov.uk/data/education/edubase/2009-08-14/ with the general pattern being: http://source.data.gov.uk/data/[sector]/[dataset]/[timestamp]/ Currently only the latest versions of each dataset are being loaded into the live SPARQL endpoints, but over time there will be a move towards using named graphs for versioning (as described at [1]). LINKED DATA, DATA DUMPS & SERVICES The sector identifier ties together the Linked Data, the data dumps, and the SPARQL endpoints and other services. For example if you're looking at some Linked Data, e.g.: http://education.data.gov.uk/id/school/100866 Then this data will be included in the SPARQL endpoint at: http://services.data.gov.uk/education/sparql The search interface at: http://services.data.gov.uk/education/search And the raw data can be found in one (or more) of the datasets accessible from: http://source.data.gov.uk/data/education/ WHAT IS NOT INCLUDED? As I explained at that start of this email, not all of the Linked Data being published from data.gov.uk, or the UK government is currently represented in these data dumps. The RDF available from the legislation.gov.uk is currently only available as Linked Data because its surfaced directly from the website. Ditto, that published from the London Gazette website as RDFa. It would be possible to regularly crawl and dump those sources, but I'm not sure if there are plans to do that yet. Other departments and projects may also surface their own data and data dumps. The other dataset that is not represented in the dump are the date-time URIs available from reference.data.gov.uk, e.g. [2]. as these are all algorithmically generated. I don't recommend anyone crawls those :) Any questions then please ask. Cheers, L. [1]. http://www.jenitennison.com/blog/node/141 [2]. http://reference.data.gov.uk/id/day/2010-09-24 -- Leigh Dodds Programme Manager, Talis Platform Talis leigh.dodds@talis.com http://www.talis.com
Received on Tuesday, 24 August 2010 13:22:26 UTC