Re: APIs and Lists from Dave Reynolds on 2009-12-15 (public-lod@w3.org from December 2009)

From: Dave Reynolds <dave.e.reynolds@googlemail.com>
Date: Tue, 15 Dec 2009 12:07:09 +0000
To: Jeni Tennison <jeni@jenitennison.com>
CC: Linked Data community <public-lod@w3.org>, John Sheridan <John.Sheridan@nationalarchives.gsi.gov.uk>
Message-ID: <4B277BED.6070105@gmail.com>
Hi Jeni,

Sorry to be slow to reply to this one.

Jeni Tennison wrote:

> Dave (Reynolds) raised the point that lists are an integral part of most 
> APIs. This is another thing that we know we need to address in the UK 
> linked government data project, but are unsure as yet how best to do so.


> This is a bit of a brain dump of my current thinking, which is mostly 
> packed with uncertainty! I'd be very grateful for any thoughts, 
> guidance, pointers that you have.

I was getting bogged down in try do my own dump in line. So instead I've 
put up a blog post [1] about a pattern we ended up using for this. This 
is in the spirit of sharing experiences rather than claiming it as a one 
true pattern.

> # Defining List Membership #

As noted in [1] we decided to use SPARQL SELECT to define the list 
membership.

This enables you to order the lists which we often found important (the 
rdf:Bag/rdfs:member pattern is unordered of course).

It also subsumes the other choices. If you want to define membership 
explicitly in the data store then the SELECT may be trivial:

    SELECT ?resource WHERE {
       <http://education.data.gov.uk/id/school/phase/nursery>
           rdfs:member ?resource . }


You can also define membership in terms of an OWL class description then 
the final access becomes:

    SELECT ?resource WHERE { ?resource a <http://myowlclass> . }

So you are able to do the meat of the definition inline in the store or 
in some inference layer but are not *required* to do this.

>   <http://education.data.gov.uk/id/school>
>     a api:List ;
>     rdfs:label "Schools"@en ;
>     api:itemType <http://education.data.gov.uk/def/school/School> .
> 
> or:
> 
>   <http://education.data.gov.uk/id/school/administrativeDistrict/47UE>
>     a api:List ;
>     rdfs:label "Schools in Worcester"@en ;
>     api:where "?item 
> <http://education.data.gov.uk/def/school/districtAdministrative> 
> <http://statistics.data.gov.uk/id/local-authority-district/47UE>" .
> 
> or use something like SPIN [1] to express the query as RDF.

In our case we defined the queries via an API but could also store them 
in an RDF configuration file as strings in SPARQL syntax. Exploding the 
query into RDF structures wasn't needed but YMMV.

> Or we could go one level higher and do something like:
> 
>   <http://education.data.gov.uk/id/school/phase/*>
>     a api:ListSet ;
>     rdfs:label "Schools By Phase of Education"@en ;
>     api:pattern 
> "http://education.data.gov.uk/id/school/phase/(nursery|primary|secondary)"^^xsd:string 
> ;
>     api:map [
>       api:regexGroup 1 ;
>       api:property rdf:type ;
>       api:enumeration [
>           api:token "nursery" ;
>           api:resource 
> <http://education.data.gov.uk/def/school/TypeOfEstablishment_EY_Setting> ;
>         ],
>         ...
>     ] .
> 
> It's not at all clear to me what the best approach it here. I tend to 
> think that although a higher-level language might make things simpler in 
> some ways, providing SPARQL queries gives the most flexibility. 

Agreed, that's the way we went and it seemed OK as far as we pushed it.

> # Pagination and Sorting #
> 
> Lists are often going to be very long, so we'll need some way to support 
> paging through the results that come back. It might also be useful to 
> provide different sort orders. For example:
> 
>   
> http://education.data.gov.uk/doc/school?sort=label&startIndex=21&itemsPerPage=20 
> 
> 
> should give the second page of (20) results, in label order.
> 
> What I thought here is that we should assign *collections* URIs like:
> 
>   http://education.data.gov.uk/id/school
> 
> These are unordered and unpaginated. A request would result in a 303 
> redirect to the document:
> 
>   http://education.data.gov.uk/doc/school
> 
> which is the same as:
> 
>   
> http://education.data.gov.uk/doc/school?sort=label&startIndex=1&itemsPerPage=20 
> 
> 
> (say) and is the first page of the (ordered, paginated) list. The RDF 
> graph actually returned would be something like:
> 
>   <http://education.data.gov.uk/id/school>
>     rdfs:label "Schools"@en ;
>     foaf:isPrimaryTopicOf <http://education.data.gov.uk/doc/school> .
> 
>   <http://education.data.gov.uk/doc/school>
>     rdfs:label "Schools (First 20, Ordered Alphabetically)"@en ;
>     foaf:primaryTopic <http://education.data.gov.uk/id/school> ;
>     xhv:next <http://education.data.gov.uk/doc/school?startIndex=21> ;
>     ... other metadata ...
>     api:items (
>       <http://education.data.gov.uk/id/school/135160>
>       <http://education.data.gov.uk/id/school/135441>
>       <http://education.data.gov.uk/id/school/135868>
>       ...
>     ) .
> 
>   <http://education.data.gov.uk/id/school/135160>
>     rdfs:label "# New Comm Pri @ Allaway Avenue" ;
>     ... other triples ...
> 
>   ... statements about the other members of this list ...
> 
> Note here that the triples about the collection are curtailed to not 
> include all the members of the collection (since to include them would 
> kinda defeat the purpose of the pagination). If the collection were 
> defined through a mechanism other than a list of members, then including 
> that configuration information would be a good thing to do here.

As described in [1] we separated the notion of the overall set of 
results from that of a window (aka page) onto that set.

In the JSON version of the window we returned an array of JSON-ized 
resources so we got both ordering and the data itself in one package.

In the RDF version we returned a subgraph containing the descriptions of 
the resources along with a list of the resources as an indirect property 
hanging off the ResultSet resource.

> One possibility is to curtail the information that's available about 
> each item: we'd probably always want to include label and type, but 
> maybe other things would be useful as well, like location in the case of 
> a school. Then again, perhaps it's best just to include everything we 
> can about those items (eg a labelled concise bounded description [2]); 
> clients can always filter out what they don't need.

We found it useful to both have a default DESCRIBE (which in our case is 
a indeed a concise bounded description) but allow for subsets expressed 
via SPARQL Construct Templates.


> # Representations #
> 
> There are four kinds of representations that are particularly useful for 
> lists:
> 
>   * RDF (RDF/XML, Turtle), obviously, for semweb heads
>   * a feed (Atom or RSS) for humans to subscribe to
>   * HTML for humans to look at
>   * JSON of some description for normal developers
> 
> The first two could be hit through using RSS 1.0 to describe the list. 
> Does anyone have any thoughts about whether that would be the best 
> approach? I think the only thing that makes me hesitate is the use of 
> rdf:Seq to list the items, which I gather has fallen out of vogue.

We found JSON arrays in the JSON very nice and used RDF 
collections/lists (i.e. not Seq) in the RDF. From the point of view of 
Turtle rendering and the RDF APIs then lists are generally fine, it's 
"only" the RDF/XML that is horrid.

Dave

[1] http://www.amberdown.net/2009/12/rdf-result-sets/
Received on Tuesday, 15 December 2009 12:07:53 UTC