APIs and Lists from Jeni Tennison on 2009-12-13 (public-lod@w3.org from December 2009)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Sun, 13 Dec 2009 20:20:39 +0000
To: Linked Data community <public-lod@w3.org>
Cc: John Sheridan <John.Sheridan@nationalarchives.gsi.gov.uk>
Message-Id: <F177BB06-26B2-4154-A6D3-1FAC5C1BF8B4@jenitennison.com>
Hi,

Dave (Reynolds) raised the point that lists are an integral part of  
most APIs. This is another thing that we know we need to address in  
the UK linked government data project, but are unsure as yet how best  
to do so.

This is a bit of a brain dump of my current thinking, which is mostly  
packed with uncertainty! I'd be very grateful for any thoughts,  
guidance, pointers that you have.

The situation is that we have our nice linked data all up and  
available at suitable URIs, for example:

   http://education.data.gov.uk/id/school/520965

(somewhat password protected; if you ignore all the prompts you'll be  
able to see the HTML page just without any styling) but we need to  
somehow provide better mechanisms for people to navigate around it.

The kind of thing I'd like to see is support for URLs like:

   http://education.data.gov.uk/doc/school
     - list of all schools
   http://education.data.gov.uk/doc/school/phase/nursery
     - list of nursery schools
   http://education.data.gov.uk/doc/school/administrativeDistrict/47UE
     - list of schools whose administrative district is Worcester

and so on.

# Defining List Membership #

The first question is: How do we define which resources are members of  
a list?

I've discussed this quite briefly with Leigh and Ian (Davis). They  
seem to favour explicitly incorporating information about the  
membership of such lists within the triplestore itself. For example,  
perhaps something like:

   <http://education.data.gov.uk/id/school>
     a rdf:Bag ;
     rdfs:label "All Schools"@en ;
     rdfs:member
       <http://education.data.gov.uk/id/school/100000> ,
       <http://education.data.gov.uk/id/school/100001> ,
       <http://education.data.gov.uk/id/school/100002> ,
       ...

   <http://education.data.gov.uk/id/school/phase/nursery>
     a rdf:Bag ;
     rdfs:label "Nursery Schools"@en ;
     rdfs:member
       <http://education.data.gov.uk/id/school/500003> ,
       <http://education.data.gov.uk/id/school/500004> ,
       <http://education.data.gov.uk/id/school/500005> ,
       ...

   <http://education.data.gov.uk/id/school/administrativeDistrict/47UE>
     a rdf:Bag ;
     rdfs:label "Schools in Worcester"@en ;
     rdfs:member
       <http://education.data.gov.uk/id/school/116749> ,
       <http://education.data.gov.uk/id/school/116750> ,
       <http://education.data.gov.uk/id/school/116751> ,
       ...

This is reasonably nice in that you can make up whatever lists you  
like without being tied to particular conventions in the list URI. It  
also means that the information about what things are in what lists is  
right there, and queryable, within the triplestore. So you could find  
nursery schools in Worcester with:

   SELECT ?school
   WHERE {
     ?school
       rdfs:member <http://education.data.gov.uk/id/school/phase/ 
nursery> ;
       rdfs:member <http://education.data.gov.uk/id/school/administrativeDistrict/47UE 
 >
   }

The difficulty lies in ensuring that the lists are correct to start  
with, especially when the data might come from multiple sources with  
their own generation routines, and that it remains up to date as the  
data changes over time. In particular, I think this approach really  
prevents you from layering an API over an existing set of data: the  
publishers of the linked data have to also determine how the API  
works, when otherwise those roles could be reasonably cleanly separated.

An alternative is to somehow define the lists in terms of a SPARQL  
query or a higher-level declarative mechanism. For example, we might do:

   <http://education.data.gov.uk/id/school>
     a api:List ;
     rdfs:label "Schools"@en ;
     api:itemType <http://education.data.gov.uk/def/school/School> .

or:

   <http://education.data.gov.uk/id/school/administrativeDistrict/47UE>
     a api:List ;
     rdfs:label "Schools in Worcester"@en ;
     api:where "?item <http://education.data.gov.uk/def/school/districtAdministrative 
 > <http://statistics.data.gov.uk/id/local-authority-district/47UE>" .

or use something like SPIN [1] to express the query as RDF.

Or we could go one level higher and do something like:

   <http://education.data.gov.uk/id/school/phase/*>
     a api:ListSet ;
     rdfs:label "Schools By Phase of Education"@en ;
     api:pattern "http://education.data.gov.uk/id/school/phase/ 
(nursery|primary|secondary)"^^xsd:string ;
     api:map [
       api:regexGroup 1 ;
       api:property rdf:type ;
       api:enumeration [
           api:token "nursery" ;
           api:resource <http://education.data.gov.uk/def/school/TypeOfEstablishment_EY_Setting 
 > ;
         ],
         ...
     ] .

It's not at all clear to me what the best approach it here. I tend to  
think that although a higher-level language might make things simpler  
in some ways, providing SPARQL queries gives the most flexibility.  
Anyone have any thoughts?

# Pagination and Sorting #

Lists are often going to be very long, so we'll need some way to  
support paging through the results that come back. It might also be  
useful to provide different sort orders. For example:

   http://education.data.gov.uk/doc/school?sort=label&startIndex=21&itemsPerPage=20

should give the second page of (20) results, in label order.

What I thought here is that we should assign *collections* URIs like:

   http://education.data.gov.uk/id/school

These are unordered and unpaginated. A request would result in a 303  
redirect to the document:

   http://education.data.gov.uk/doc/school

which is the same as:

   http://education.data.gov.uk/doc/school?sort=label&startIndex=1&itemsPerPage=20

(say) and is the first page of the (ordered, paginated) list. The RDF  
graph actually returned would be something like:

   <http://education.data.gov.uk/id/school>
     rdfs:label "Schools"@en ;
     foaf:isPrimaryTopicOf <http://education.data.gov.uk/doc/school> .

   <http://education.data.gov.uk/doc/school>
     rdfs:label "Schools (First 20, Ordered Alphabetically)"@en ;
     foaf:primaryTopic <http://education.data.gov.uk/id/school> ;
     xhv:next <http://education.data.gov.uk/doc/school?startIndex=21> ;
     ... other metadata ...
     api:items (
       <http://education.data.gov.uk/id/school/135160>
       <http://education.data.gov.uk/id/school/135441>
       <http://education.data.gov.uk/id/school/135868>
       ...
     ) .

   <http://education.data.gov.uk/id/school/135160>
     rdfs:label "# New Comm Pri @ Allaway Avenue" ;
     ... other triples ...

   ... statements about the other members of this list ...

Note here that the triples about the collection are curtailed to not  
include all the members of the collection (since to include them would  
kinda defeat the purpose of the pagination). If the collection were  
defined through a mechanism other than a list of members, then  
including that configuration information would be a good thing to do  
here.

One possibility is to curtail the information that's available about  
each item: we'd probably always want to include label and type, but  
maybe other things would be useful as well, like location in the case  
of a school. Then again, perhaps it's best just to include everything  
we can about those items (eg a labelled concise bounded description  
[2]); clients can always filter out what they don't need.

To facilitate ordering, we need a capability to label the different  
properties with short names; this is similar to (maybe the same as)  
the requirement for JSON renditions of RDF, so it would make sense to  
use the same kind of rules (ie default to the local name of the  
property, provide configuration to override).

# Representations #

There are four kinds of representations that are particularly useful  
for lists:

   * RDF (RDF/XML, Turtle), obviously, for semweb heads
   * a feed (Atom or RSS) for humans to subscribe to
   * HTML for humans to look at
   * JSON of some description for normal developers

The first two could be hit through using RSS 1.0 to describe the list.  
Does anyone have any thoughts about whether that would be the best  
approach? I think the only thing that makes me hesitate is the use of  
rdf:Seq to list the items, which I gather has fallen out of vogue.

---

Anyway, hopefully Dave will set up a Google Code project where we can  
try to spec some of this out and maybe get some implementations in  
place.

Any thoughts welcome!

Jeni

[1]: http://spinrdf.org/sp.html
[2]: http://n2.talis.com/wiki/Bounded_Descriptions_in_RDF#Labelled_Concise_Bounded_Description
-- 
Jeni Tennison
http://www.jenitennison.com
Received on Sunday, 13 December 2009 20:21:14 UTC