Re: last call for comments from Leigh Dodds on 2009-08-26 (public-egov-ig@w3.org from August 2009)

From: Leigh Dodds <leigh.dodds@talis.com>
Date: Wed, 26 Aug 2009 10:05:55 +0100
To: Daniel Bennett <daniel@citizencontact.com>
Cc: eGovIG <public-egov-ig@w3.org>
Message-ID: <f323a4470908260205t66f32c1fwde340f053a27fc42@mail.gmail.com>
Hi Daniel,

I've gone through and reviewed the document and have included some
comments below.

* Initial Paragraph:

This needs to be slightly longer to more clearly set out what the memo
is actually about.

* Section: "Some easy steps, but only starting points"

I think this section needs to be heavily reworked. I don't think it
works well as the opening section of the memo as the core message:
make sure you publish some raw data, in a machine-processable format,
is obscured by references to a lot of different technologies a number
of which are either obsolete or rarely used (Gopher, XPointer).
There's also specific technical advice without reference to further
reading, or inaccurate, e.g. which accessibility requirements should
be abided by?

I wonder whether this section might be better placed towards the end
of the document after the "We're all learning" section. This makes a
nice, readable progression. E.g: "we're all learning, but here's some
simple steps that have proved successful so far".

However I do think there needs to be a simple clear message right at
the start of the document, that publishing raw data, ideally in CSV,
XML, RDF, or XLS, is the single most important step.

* Section: "Identify"

A suggested revision:

It should be a matter of best practice for publishing open government
data on the web to apply the technical principles described in
Architecture of the World Wide Web, Volume 1. The critical
foundational principle is to identify things using a URI/URL. This
applies to not just the documents and files that carry the data, but
also the resources which are referenced or described in that data:
i.e. the people, places, events, legislation, etc. Permanent, easily
discoverable URIs, form the basis for creating unique identifiers that
scale to the web. These stable identifiers can also be used to tie
together data from different sources, greatly simplifying data
integration.

Defining simple patterns for creating new URIs makes it easy for
different groups and departments to create unique, global identifiers.
For example a URI can be created by appending an existing unique,
non-web identifier, e.g. derived from a database key, to a common base
URL. E.g. Data about organization 12345 could be published at
http://www.example.gov/organizations/12345, whilst data about area
code A4567 could be published at
http://www.example.gov/organizations/A4567. Agreeing on simple
patterns for creating new, unique, and importantly, stable identifiers
is an important first step in putting data onto the web.

* Section "Document"

Suggested revision:

Without supporting documentation, e.g. to describe the contents of a
dataset, the published data may be hard to reuse. Publishing some
minimal documentation with a dataset, e.g. at an associated web page,
will ensure that re-users can clearly understand what the dataset
contains. Minimal documentation would include a title, description, a
publication date, and perhaps some notes on the origins of the data. A
noted later in this memo, the license for the data should also be
clearly documented.

If data is published according to either custom or industry standard
schemas, then also include links and references to the relevant
standards so that developers can find additional supporting
documentation and tools.

Building a browsable and/or searchable directory of data is also a
useful way of allowing people to find the range of datasets that are
available.

* Section "Link"

This is the first section that refers to "linked data", listing the
four main principles. I think a bit more context is required here:
initially the memo talks about at least publishing raw data using CSV,
XML, etc. It seems a leap to then jump to Linked Data. Perhaps in this
section of the memo, which is mainly about basic best practices and
principles, it would be enough to say that it is important to include
links both in the supporting data and, where supported (e.g. if using
RDF), within the data itself. If the "easy steps" section is moved to
the end, then this could introduce Linked Data as a natural step
beyond publishing raw data, with perhaps a recap of the princples
pointing out how Linked Data fulfills all of them?

* Section "Preserve"

There's a hanging sentence in this section.

Issues to consider should be:

* preservation of URIs/URLs to ensure stability of linking to datasets
and data items
* versioning of datasets, so that people can cite and link to both new
and past versions. Logical links, e.g. "/latest" are also worth
considering for downloadable datasets
* formats: XML, RDF, etc are arguably better for preservation than e.g. Excel
* supporting documentation that describes how a dataset may have
evolved, e.g. have terms or method of collection changed?

* Section "Expose Interfaces"

I think this section should stress that:

 * as an initial step, raw machine-readable data and interfaces are
the crucial first goal; the community can create new and interesting
interfaces
 * offering both human and machine-readable interfaces should be a
best practice, enabling browsing and discovery for all audiences, but
don't focus on creating flashy visualisations if they detract from
delivering on the first goal
 * Using the principles of Linked Data and RDF, there's no need for a
separate API as the website is the API.
 * A SPARQL endpoint adds greater utility to RDF datasets
 * Where a API is going to be created, e.g. to publish data as XML,
then avoid using standards like SOAP and concentrate on using simple
RESTful patterns -- with references to relevant resources

* Section "Choosing what to publish as data on the Web"

I think this section should come at the head of the document, then the
progression is: here's what you should be thinking about publishing;
here are some principle and issues to consider; and some steps to
achieve it.

The guidance ought to be to open up any non-personal data that the
government currently collects and maintains on behalf of its citizens,
with an emphasis on data for legislation, national statistics, and key
entities like registered companies, locations, administrative
boundaries, etc. I think part of the goal is to not only unlock data
that governments should be making easily available, its to also create
an infrastructure that lets *others* begin to tie their data into an
authoritative URI space managed by the government and/or its
departments. So, e.g. having a unique identifier for every registered
company or school is as important as having information about those
resources.

The references to schemas and documentation could probably be included
in the "Document" section.

* Section "Social Issues"

Suggest this is renamed to "Licensing" and is expanded to stress the
important for clear licensing of data, using an open,
non-transactional model. Ideally public domain licenses like CC0,
PDDL, ODbL should be used or customized to achieve this.

--

I hope those comments are useful. I'm more than happy to help
contribute further to editing of the document.

Cheers,

L.

-- 
Leigh Dodds
Programme Manager, Talis Platform
Talis
leigh.dodds@talis.com
http://www.talis.com
Received on Wednesday, 26 August 2009 09:06:37 UTC