Re: APIs and Lists from Herbert Van de Sompel on 2009-12-13 (public-lod@w3.org from December 2009)

From: Herbert Van de Sompel <hvdsomp@gmail.com>
Date: Sun, 13 Dec 2009 16:59:15 -0500
To: Jeni Tennison <jeni@jenitennison.com>
Cc: Linked Data community <public-lod@w3.org>, John Sheridan <John.Sheridan@nationalarchives.gsi.gov.uk>
Message-Id: <241BD758-7F01-4310-98F5-93AC41DE826E@gmail.com>
Hi Jeni

I wonder whether ORE Aggregations could be (part of) a solution:

http://www.openarchives.org/ore/1.0/toc

Greetings

Herbert Van de Sompel

Sent from my iPhone

On Dec 13, 2009, at 15:20, Jeni Tennison <jeni@jenitennison.com> wrote:

> Hi,
>
> Dave (Reynolds) raised the point that lists are an integral part of  
> most APIs. This is another thing that we know we need to address in  
> the UK linked government data project, but are unsure as yet how  
> best to do so.
>
> This is a bit of a brain dump of my current thinking, which is  
> mostly packed with uncertainty! I'd be very grateful for any  
> thoughts, guidance, pointers that you have.
>
> The situation is that we have our nice linked data all up and  
> available at suitable URIs, for example:
>
>  http://education.data.gov.uk/id/school/520965
>
> (somewhat password protected; if you ignore all the prompts you'll  
> be able to see the HTML page just without any styling) but we need  
> to somehow provide better mechanisms for people to navigate around it.
>
> The kind of thing I'd like to see is support for URLs like:
>
>  http://education.data.gov.uk/doc/school
>    - list of all schools
>  http://education.data.gov.uk/doc/school/phase/nursery
>    - list of nursery schools
>  http://education.data.gov.uk/doc/school/administrativeDistrict/47UE
>    - list of schools whose administrative district is Worcester
>
> and so on.
>
> # Defining List Membership #
>
> The first question is: How do we define which resources are members  
> of a list?
>
> I've discussed this quite briefly with Leigh and Ian (Davis). They  
> seem to favour explicitly incorporating information about the  
> membership of such lists within the triplestore itself. For example,  
> perhaps something like:
>
>  <http://education.data.gov.uk/id/school>
>    a rdf:Bag ;
>    rdfs:label "All Schools"@en ;
>    rdfs:member
>      <http://education.data.gov.uk/id/school/100000> ,
>      <http://education.data.gov.uk/id/school/100001> ,
>      <http://education.data.gov.uk/id/school/100002> ,
>      ...
>
>  <http://education.data.gov.uk/id/school/phase/nursery>
>    a rdf:Bag ;
>    rdfs:label "Nursery Schools"@en ;
>    rdfs:member
>      <http://education.data.gov.uk/id/school/500003> ,
>      <http://education.data.gov.uk/id/school/500004> ,
>      <http://education.data.gov.uk/id/school/500005> ,
>      ...
>
>  <http://education.data.gov.uk/id/school/administrativeDistrict/47UE>
>    a rdf:Bag ;
>    rdfs:label "Schools in Worcester"@en ;
>    rdfs:member
>      <http://education.data.gov.uk/id/school/116749> ,
>      <http://education.data.gov.uk/id/school/116750> ,
>      <http://education.data.gov.uk/id/school/116751> ,
>      ...
>
> This is reasonably nice in that you can make up whatever lists you  
> like without being tied to particular conventions in the list URI.  
> It also means that the information about what things are in what  
> lists is right there, and queryable, within the triplestore. So you  
> could find nursery schools in Worcester with:
>
>  SELECT ?school
>  WHERE {
>    ?school
>      rdfs:member <http://education.data.gov.uk/id/school/phase/ 
> nursery> ;
>      rdfs:member <http://education.data.gov.uk/id/school/administrativeDistrict/47UE 
> >
>  }
>
> The difficulty lies in ensuring that the lists are correct to start  
> with, especially when the data might come from multiple sources with  
> their own generation routines, and that it remains up to date as the  
> data changes over time. In particular, I think this approach really  
> prevents you from layering an API over an existing set of data: the  
> publishers of the linked data have to also determine how the API  
> works, when otherwise those roles could be reasonably cleanly  
> separated.
>
> An alternative is to somehow define the lists in terms of a SPARQL  
> query or a higher-level declarative mechanism. For example, we might  
> do:
>
>  <http://education.data.gov.uk/id/school>
>    a api:List ;
>    rdfs:label "Schools"@en ;
>    api:itemType <http://education.data.gov.uk/def/school/School> .
>
> or:
>
>  <http://education.data.gov.uk/id/school/administrativeDistrict/47UE>
>    a api:List ;
>    rdfs:label "Schools in Worcester"@en ;
>    api:where "?item <http://education.data.gov.uk/def/school/districtAdministrative 
> > <http://statistics.data.gov.uk/id/local-authority-district/47UE>" .
>
> or use something like SPIN [1] to express the query as RDF.
>
> Or we could go one level higher and do something like:
>
>  <http://education.data.gov.uk/id/school/phase/*>
>    a api:ListSet ;
>    rdfs:label "Schools By Phase of Education"@en ;
>    api:pattern "http://education.data.gov.uk/id/school/phase/(nursery|primary|secondary) 
> "^^xsd:string ;
>    api:map [
>      api:regexGroup 1 ;
>      api:property rdf:type ;
>      api:enumeration [
>          api:token "nursery" ;
>          api:resource <http://education.data.gov.uk/def/school/TypeOfEstablishment_EY_Setting 
> > ;
>        ],
>        ...
>    ] .
>
> It's not at all clear to me what the best approach it here. I tend  
> to think that although a higher-level language might make things  
> simpler in some ways, providing SPARQL queries gives the most  
> flexibility. Anyone have any thoughts?
>
> # Pagination and Sorting #
>
> Lists are often going to be very long, so we'll need some way to  
> support paging through the results that come back. It might also be  
> useful to provide different sort orders. For example:
>
>  http://education.data.gov.uk/doc/school?sort=label&startIndex=21&itemsPerPage=20
>
> should give the second page of (20) results, in label order.
>
> What I thought here is that we should assign *collections* URIs like:
>
>  http://education.data.gov.uk/id/school
>
> These are unordered and unpaginated. A request would result in a 303  
> redirect to the document:
>
>  http://education.data.gov.uk/doc/school
>
> which is the same as:
>
>  http://education.data.gov.uk/doc/school?sort=label&startIndex=1&itemsPerPage=20
>
> (say) and is the first page of the (ordered, paginated) list. The  
> RDF graph actually returned would be something like:
>
>  <http://education.data.gov.uk/id/school>
>    rdfs:label "Schools"@en ;
>    foaf:isPrimaryTopicOf <http://education.data.gov.uk/doc/school> .
>
>  <http://education.data.gov.uk/doc/school>
>    rdfs:label "Schools (First 20, Ordered Alphabetically)"@en ;
>    foaf:primaryTopic <http://education.data.gov.uk/id/school> ;
>    xhv:next <http://education.data.gov.uk/doc/school?startIndex=21> ;
>    ... other metadata ...
>    api:items (
>      <http://education.data.gov.uk/id/school/135160>
>      <http://education.data.gov.uk/id/school/135441>
>      <http://education.data.gov.uk/id/school/135868>
>      ...
>    ) .
>
>  <http://education.data.gov.uk/id/school/135160>
>    rdfs:label "# New Comm Pri @ Allaway Avenue" ;
>    ... other triples ...
>
>  ... statements about the other members of this list ...
>
> Note here that the triples about the collection are curtailed to not  
> include all the members of the collection (since to include them  
> would kinda defeat the purpose of the pagination). If the collection  
> were defined through a mechanism other than a list of members, then  
> including that configuration information would be a good thing to do  
> here.
>
> One possibility is to curtail the information that's available about  
> each item: we'd probably always want to include label and type, but  
> maybe other things would be useful as well, like location in the  
> case of a school. Then again, perhaps it's best just to include  
> everything we can about those items (eg a labelled concise bounded  
> description [2]); clients can always filter out what they don't need.
>
> To facilitate ordering, we need a capability to label the different  
> properties with short names; this is similar to (maybe the same as)  
> the requirement for JSON renditions of RDF, so it would make sense  
> to use the same kind of rules (ie default to the local name of the  
> property, provide configuration to override).
>
> # Representations #
>
> There are four kinds of representations that are particularly useful  
> for lists:
>
>  * RDF (RDF/XML, Turtle), obviously, for semweb heads
>  * a feed (Atom or RSS) for humans to subscribe to
>  * HTML for humans to look at
>  * JSON of some description for normal developers
>
> The first two could be hit through using RSS 1.0 to describe the  
> list. Does anyone have any thoughts about whether that would be the  
> best approach? I think the only thing that makes me hesitate is the  
> use of rdf:Seq to list the items, which I gather has fallen out of  
> vogue.
>
> ---
>
> Anyway, hopefully Dave will set up a Google Code project where we  
> can try to spec some of this out and maybe get some implementations  
> in place.
>
> Any thoughts welcome!
>
> Jeni
>
> [1]: http://spinrdf.org/sp.html
> [2]: http://n2.talis.com/wiki/Bounded_Descriptions_in_RDF#Labelled_Concise_Bounded_Description
> -- 
> Jeni Tennison
> http://www.jenitennison.com
>
>
Received on Sunday, 13 December 2009 21:59:58 UTC