Re: Creating JSON from RDF

Hi Dave :)

On 13 Dec 2009, at 13:34, Dave Reynolds wrote:
> Jeni Tennison wrote:
>> As part of the linked data work the UK government is doing, we're  
>> looking at how to use the linked data that we have as the basis of  
>> APIs that are readily usable by developers who really don't want to  
>> learn about RDF or SPARQL.
>
> Wow! Talk about timing. We are looking at exactly the same issue as  
> part of the TSB work and were starting to look at JSON formats just  
> this last couple of days. We should combine forces.

Excellent :)

>> One thing that we want to do is provide JSON representations of  
>> both RDF graphs and SPARQL results. I wanted to run some ideas past  
>> this group as to how we might do that.
>
> I agree we want both graphs and SPARQL results but I think there is  
> another third case - lists of described objects.

I absolutely agree with you that lists of described objects is an  
essential part of an API. In fact, I was going to (and will!) write a  
separate message about possible approaches for creating such lists.

It seemed to me that lists could be represented with RDF like:

   <http://statistics.data.gov.uk/doc/local-authority?page=1>
     rdfs:label "Local Authorities - Page 1" ;
     xhv:next <http://statistics.data.gov.uk/doc/local-authority? 
page=2> ;
     ...
     api:contents (
       <http://statistics.data.gov.uk/id/local-authority/00QA>
       <http://statistics.data.gov.uk/id/local-authority/00QB>
       <http://statistics.data.gov.uk/id/local-authority/45UB>
       ...
     )

This is just RDF, and as such any rules that we create about mapping  
RDF graphs to JSON could apply. (I agree that the list page should  
include extra information about the items in the list, but that seems  
to me to be a separable issue.)

[snip]
>> RDFj [3] is closer to what I think is needed here. However, I don't  
>> think there's a need for setting 'context' given I'm not aiming for  
>> an interchange format, there are no clear rules about how to  
>> generate it from an arbitrary graph (basically there can't be  
>> without some additional configuration) and it's not clear how to  
>> deal with datatypes or languages.
>
> WRT 'context' you might not need it but it I don't think it is  
> harmful.  I think if we said to developers that there is some outer  
> wrapper like:
>
> {
>   "format" : "RDF-JSON",
>   "version" : "0.1",
>   "mapping" :  ... magic stuff ...
>   "data" : ... the bit you care about ...
> }
>
> The developers would be quite happy doing that one dereference and  
> ignore the mapping stuff but it might allow inversion back to RDF  
> for those few who do care, or come to care.

OK. I agree that the 'come to care' is important. What I don't want is  
for concerns about round-tripping JSON data to override concerns about  
the usability of the resulting format.

>> I suppose my first question is whether there are any other JSON- 
>> based formats that we should be aware of, that we could use or  
>> borrow ideas from?
>
> The one that most intrigued me as a possible starting point was the  
> Simile Exhibit JSON format [1]. It is developer friendly in much the  
> way that you talk about but it has the advantage of zero  
> configuration, some measure of invertability, has an online  
> translator [2] and is supported by the RPI Sparql proxy [3].
>
> I've some reservations about standardizing on it as is:
> - lack of documentation of the mapping
> - some inconsistencies in how references between resources are  
> encoded (at least judging by the output of Babel[2] on test cases)
> - handling of bNodes - I'd rather single referenced bNodes were  
> serialized as nested structures

I agree, that looks pretty good but needs a bit of work.

One thing it makes me think is that perhaps JSON Schema [1] could form  
the basis of the mechanism for expressing any extra stuff that's  
required about the properties.

> One starting assumption to call out: I'd like to aim for a zero  
> configuration option and that explicit configuration is only used to  
> help tidy things up but isn't required to get started.

Agreed.

> In terms of details I was thinking of following the Simile  
> convention on short form naming that, in the absence of clashes, use  
> the rdfs:label falling back to the localname, as the basis for the  
> shortened property names. So knowing nothing else the bNode would be:
>
>  ...
>    "editor": {
>       "fullName": "Dave Beckett",
>       "homePage": "http://purl.org/net/dajobe/"
>    }
>
> In the event of clashes then fall back on a prefix based  
> disambiguation.

Agreed. I rather deliberately and unnecessarily chose to change the  
names for the example in order to demonstrate the principle.

>> Note that the "$" is taken from RDFj. I'm not convinced it's a good  
>> idea to use this symbol, rather than simply a property called  
>> "about" or "this" -- any opinions?
>
> I'd prefer "id" (though "about" is OK), "$" is too heavily overused  
> in javascript libraries.

I agree. From the brief survey of JSON APIs that I did just now, it  
seems as though prefixing a reserved property name with a '_' is the  
usual thing. I'd suggest '_about' because it's similar to RDFa and  
because '_id', to me at least, implies a local identifier rather than  
a URI.

>> Also note that I've made no distinction in the above between a URI  
>> and a literal, while RDFj uses <>s around literals. My feeling is  
>> that normal developers really don't care about the distinction  
>> between a URI literal and a pointer to a resource, and that they  
>> will base the treatment of the value of a property on the (name of)  
>> the property itself.
>
> Probably right.
>
> Actually, in your example isn't that value a resource anyway? To  
> make it a literal you'd have to have:
>
>  ex:homePage "http://purl.org/net/dajobe/"^^xsd:anyURI

Yes, it was a resource in the example. In RDFj, it would have been:

    {
      "$": "http://www.w3.org/TR/rdf-syntax-grammar",
      "title": "RDF/XML Syntax Specification (Revised)",
      "editor": {
        "name": "Dave Beckett",
        "homepage": "<http://purl.org/net/dajobe/>"
      }
    }

(note the <>s around the value for the home page). The point I was  
making was that I propose that it be impossible to tell whether the  
value was a literal or a reference to another resource.

>> So, the first piece of configuration that I think we need here is  
>> to map properties on to short names that make good JSON identifiers  
>> (ie name tokens without hyphens). Given that properties normally  
>> have lowercaseCamelCase local names, it should be possible to use  
>> that as a default. If you need something more readable, though, it  
>> seems like it should be possible to use a property of the property,  
>> such as:
>>  ex:fullName api:jsonName "name" .
>>  ex:homePage api:jsonName "homepage" .
>
> Suggest Simile approach and have api:jsonName or your API as an  
> optional extra for resolving problems rather than a requirement.

Agreed; I intended this to be the case, but didn't make it explicit.

>> However, in any particular graph, there may be properties that have  
>> been given the same JSON name (or, even more probably, local name).  
>> We could provide multiple alternative names that could be chosen  
>> between, but any mapping to JSON is going to need to give  
>> consistent results across a given dataset for people to rely on it  
>> as an API, and that means the mapping can't be based on what's  
>> present in the data. We could do something with prefixes, but I  
>> have a strong aversion to assuming global prefixes.
>> So I think this means that we need to provide configuration at an  
>> API level rather than at a global level: something that can be used  
>> consistently across a particular API to determine the token that's  
>> used for a given property. For example:
>>  <> a api:JSON ;
>>    api:mapping [
>>      api:property ex:fullName ;
>>      api:name "name" ;
>>    ] , [
>>      api:property ex:homePage ;
>>      api:name "homepage" ;
>>    ] .
>
> Are you thinking of this as something the publisher provides or the  
> API caller provides?
>
> If the former, then OK but as I say I think a zero config set of  
> default conventions is OK with the API to allow fine tuning.

I'm thinking of this as something that the publisher of the API  
creates (to describe/define the API). Note, though, that the publisher  
of the API might not be the publisher of the data, and that it could  
feasibly be possible for there to be a service that would allow  
clients to supply a configuration, point at a datastore, and have the  
API just work.

>> The trouble is that if you determine whether something is an array  
>> or not based on the data that is actually available, you'll get  
>> situations where the value of a particular JSON property is  
>> sometimes an array and sometimes a string; that's bad for  
>> predictability for the people using the API. (RDF/JSON solves this  
>> by every value being an array, but that's counter-intuitive for  
>> normal developers.)
>> So I think a second API-level configuration that needs to be made  
>> is to indicate which properties should be arrays and which not:
>>  <> a api:API ;
>>    api:mapping [
>>      api:property foaf:nick ;
>>      api:name "nick" ;
>>      api:array true ;
>>    ] .
>
> So if this is not specified in the mapping then you get the  
> unpredictable behaviour but by providing a mapping spec you can  
> force arrays on single values but not force singletons on multi- 
> values. Is that right? If so OK.

I guess there are two choices if there was no specification:

   1. always give one value for the property; if there are several  
values in the graph, then provide "the first"
   2. give an array when there are multiple values and a singleton  
when there's only one

I did have another vague notion of providing two properties side by  
side, one singular and one plural, so you would have:

   {
     "nick": "JeniT"
   }

or

   {
     "nicks": ["wilding", "wilda"]
   }

side by side in the same list of objects. But of course that would  
require configuration anyway (to provide pluralised versions of the  
label), so I'm not particularly taken with it.

It does concern me that if there are RDF graphs which contain  
descriptions of several resources of the same type, we might get into  
a situation where there are two resources for which the default  
behaviour would be different; we need to have a way of reconciling  
this (for example, if any of the resources in the graph have multiple  
values for a property, then it always uses an array).

> There is a related issue: how to represent RDF lists. There are  
> times you want ordered property values. At the RDF end the good way  
> to do that is to use lists (sorry "collections"). I'd argue that a  
> natural representation of:
>
>   <http://example.com/ourpaper>
>       ex:authors (
>              <http://example.com/people#Jeni>
>               <http://example.com/people#Dave
>       ) .
>
> is
>
>  {
>      "id" : "http://example.com/ourpaper",
>      "authors" : [
>         "http://example.com/people#Jeni",
>         "http://example.com/people#Dave"
>      ]
>  }
>
> The problem is that this looks just the same as the multi-valued case.
>
> We could:
> (1) decide not to care, the mapping can't be inverted
> (2) keep this mapping but include context information in the outer  
> wrapper that allows the inversion (in uniform cases)
> (3) have a separate list notation:
>
>  {
>      "id" : "http://example.com/ourpaper",
>      "authors" : { "type" : "list", "value" : [
>         "http://example.com/people#Jeni",
>         "http://example.com/people#Dave"
>      ] }
>  }
>
> My preference is (2) because I think lists are really useful and  
> should be as simple as possible in the JSON translation but think  
> (3) is technically cleaner.

I'd prefer either (1) or (2) and not (3). Since I don't care about  
reconstructing the RDF, I'm quite happy with (1) but doing (2)  
certainly wouldn't hurt given that we want to supply some metadata  
about the mapping.

[snip]
> Language codes are effectively open ended. I can't necessarily  
> predict what lang codes are going to be in my data and provide a  
> property mapping for every single one.

I know they're *potentially* open-ended; I think in practice, for a  
single API, they are probably not. And even in the case of data that  
does have multiple languages (eg DBPedia) it would be possible to  
create a list based on the IANA language subtag registry [2] if you  
were concerned.

> Plus when working with language-tagged data you often have code to  
> do a "best match" (not simple lookup) between the user's language  
> preferences and the available lang tags. That looks hard if each is  
> in a different property and the lang tags themselves are hidden in  
> the API configuration.
>
> I think we may need the long winded encoding available:
>
> {
>  "id" : "http://statistics.data.gov.uk/id/local-authority-district/00PB 
> ",
>  "prefLabel" : [
>    "The County Borough of Bridgend",
>    { "value" : "The County Borough of Bridgend", "lang" : "en" },
>    { "value" : "Pen-y-bont ar Ogwr", "lang : "cy" }
>  ]
>  ...
>
> Then it would up to the publisher whether provide the simpler  
> properties as well or instead. But those could be regard as  
> transformations of the RDF for convenience (much like choosing to  
> include RDFS closure info).

As I say, I'm not convinced that this is a big enough issue to sweat  
over, but another possibility would be to perform some basic string  
manipulation to create separate properties as required. For example:

  {
    "_about" : "http://statistics.data.gov.uk/id/local-authority-district/00PB 
",
    "prefLabel": "The County Borough of Bridgend",
    "prefLabel_en": "The County Borough of Bridgend",
    "prefLabel_cy": "Pen-y-bont ar Ogwr"
  }

Note that the language of the value of the property without the  
language suffix is probably something that you'd want in the API  
configuration (and possibly overridable by the client).

> Turning to data types ...
>
> Your onsCode examples are a particular pattern for how to use  
> datatypes which are indeed a similar case to lang tags. But how are  
> you thinking of handling the common cases like the XSD types?
>
> I'm assuming that all the number formats would all become JSON  
> numbers rather than strings, right? That looses the distinction  
> between say xsd:decimal and xsd:float but javascript doesn't care  
> about that and if we are not doing an interchange format that's OK.

Right.

> For things like xsd:dateTime then there seems a couple of options.  
> The Simile type option would be to have them as strings but define  
> the range of the property in some associated context/properties table.
>
> The other would be to use a structured representation:
>
>  {
>      "id" : "http://example.com/ourpaper",
>      "date" : { "type" : date, "value" : "20091312"}
>     ...
>
> I'm guessing you would just have them as strings and let the  
> consumer figure out when they want to treat them as dates, is that  
> right?

That would be my preference, but I think the strings should  
(unfortunately) use formats understood by the Javascript Date.parse()  
method [3]. So the above would be:

   {
     "_about": "http://example.com/ourpaper",
     "date": "13 Dec, 2009"
   }

(A slight aside for interest: the Google Visualisation Data API [4]  
departs from JSON in a number of ways, one of which is to include a  
Date(...) syntax for dates.)

>> The one that's required could be indicated through the  
>> configuration, for example:
>>  <> a api:API ;
>>    api:mapping [
>>      api:property foaf:maker ;
>>      api:name "maker" ;
>>      api:embed true ;
>>    ] .
>
> My zero-configuration default would be to nest single-referenced  
> bNodes and have everything else as top level resources with cross- 
> references, as above.

Works for me.

>> The final thought that I had for representing RDF graphs as JSON  
>> was about suppressing properties. Basically I'm thinking that this  
>> configuration should work on any graph, most likely one generated  
>> from a DESCRIBE query. That being the case, it's likely that there  
>> will be properties that repeat information (because, for example,  
>> they are a super-property of another property). It will make a  
>> cleaner JSON API if those repeated properties aren't included. So  
>> something like:
>>  <> a api:API ;
>>    api:mapping [
>>      api:property admingeo:contains ;
>>      api:ignore true ;
>>    ] .
>
> Seems reasonable but seems a separate issue from the JSON encoding.

OK, there's enough else to specify so I'm quite happy to not be too  
concerned about this.

I don't know where the best place is to work on this: I guess at some  
point it would be good to set up a Wiki page or something that we  
could use as a hub for discussion?

Cheers,

Jeni

[1]: http://json-schema.org/
[2]: http://www.iana.org/assignments/language-subtag-registry
[3]: https://developer.mozilla.org/En/Core_JavaScript_1.5_Reference/Objects/Date/Parse
[4]: http://code.google.com/apis/visualization/documentation/dev/implementing_data_source.html#jsondatatable
-- 
Jeni Tennison
http://www.jenitennison.com

Received on Sunday, 13 December 2009 16:48:19 UTC