Fwd: [uk-government-data-developers] Data Dumps at source.data.gov.uk

This post by Leigh Dodds to the uk-government-data-developers list
about source.data.gov.uk should be of potential interest. It's great
to see a methodical approach to making data dumps for data.gov.uk
available.

//Ed

---------- Forwarded message ----------
From: Leigh Dodds <leigh.dodds@talis.com>
Date: Tue, Aug 24, 2010 at 8:52 AM
Subject: [uk-government-data-developers] Data Dumps at source.data.gov.uk
To: uk-government-data-developers
<uk-government-data-developers@googlegroups.com>

Hi,

I've just put together an initial set of data dumps for the majority
of the Linked Data currently being published by data.gov.uk. More
information on what's not included and why in a moment.

(Disclaimer: what follows is my understanding of the current state of
play, so any errors/omissions then blame me :)


THE REPOSITORY

There is a server at http://source.data.gov.uk which has been set up
to provide access to both data dumps and (eventually) the code used to
generate/convert the data. The data dumps can be found at:

http://source.data.gov.uk/data/

The intention is to create a repository of versioned datasets that
will allow anyone to mirror the data for their own use/purposes, e.g.
to perform local analysis or to host in your own triple store. Over
time this repository should become a complete archival copy of all of
the Linked Data that is published through data.gov.uk, complete with
information on the provenance of individual datasets.

The team behind data.gov.uk are still working through a number of the
best practices, so right now I've simply put up copies of all the
currently live datasets.


HOW THE DATA IS ORGANISED

The web archive is organised into a series of sub-directories:

* Sector — top-level sector. E.g. as used in *.data.gov.uk
* Dataset — dataset directory, a short identifier for the dataset.
I've made some of these up at present
* Date-stamped directory — in format of yyyy-mm-dd.
* Data files — This may be an number of data files in different
formats. E.g the data may span a number of small files, some files may
be ntriples for loading into default graph and some files may be
nquads.

For example, the RDF version of Edubase currently available from
http://education.data.gov.uk can be found here:

http://source.data.gov.uk/data/education/edubase/2009-08-14/

with the general pattern being:

http://source.data.gov.uk/data/[sector]/[dataset]/[timestamp]/

Currently only the latest versions of each dataset are being loaded
into the live SPARQL endpoints, but over time there will be a move
towards using named graphs for versioning (as described at [1]).


LINKED DATA, DATA DUMPS & SERVICES

The sector identifier ties together the Linked Data, the data dumps,
and the SPARQL endpoints and other services. For example if you're
looking at some Linked Data, e.g.:

http://education.data.gov.uk/id/school/100866

Then this data will be included in the SPARQL endpoint at:

http://services.data.gov.uk/education/sparql

The search interface at:

http://services.data.gov.uk/education/search

And the raw data can be found in one (or more) of the datasets accessible from:

http://source.data.gov.uk/data/education/


WHAT IS NOT INCLUDED?

As I explained at that start of this email, not all of the Linked Data
being published from data.gov.uk, or the UK government is currently
represented in these data dumps.

The RDF available from the legislation.gov.uk is currently only
available as Linked Data because its surfaced directly from the
website. Ditto, that published from the London Gazette website as
RDFa. It would be possible to regularly crawl and dump those sources,
but I'm not sure if there are plans to do that yet. Other departments
and projects may also surface their own data and data dumps.

The other dataset that is not represented in the dump are the
date-time URIs available from reference.data.gov.uk, e.g. [2]. as
these are all algorithmically generated. I don't recommend anyone
crawls those :)

Any questions then please ask.

Cheers,

L.

[1]. http://www.jenitennison.com/blog/node/141
[2]. http://reference.data.gov.uk/id/day/2010-09-24

--
Leigh Dodds
Programme Manager, Talis Platform
Talis
leigh.dodds@talis.com
http://www.talis.com

Received on Tuesday, 24 August 2010 13:22:26 UTC