Re: Creating JSON from RDF from Dave Reynolds on 2009-12-13 (public-lod@w3.org from December 2009)

From: Dave Reynolds <dave.e.reynolds@googlemail.com>
Date: Sun, 13 Dec 2009 13:34:50 +0000
To: Jeni Tennison <jeni@jenitennison.com>
CC: public-lod@w3.org, Mark Birbeck <mark.birbeck@webbackplane.com>, John Sheridan <John.Sheridan@nationalarchives.gsi.gov.uk>
Message-ID: <4B24ED7A.6080203@gmail.com>
Hi Jeni,

Jeni Tennison wrote:

> As part of the linked data work the UK government is doing, we're 
> looking at how to use the linked data that we have as the basis of APIs 
> that are readily usable by developers who really don't want to learn 
> about RDF or SPARQL.

Wow! Talk about timing. We are looking at exactly the same issue as part 
of the TSB work and were starting to look at JSON formats just this last 
couple of days. We should combine forces.

> One thing that we want to do is provide JSON representations of both RDF 
> graphs and SPARQL results. I wanted to run some ideas past this group as 
> to how we might do that.

I agree we want both graphs and SPARQL results but I think there is 
another third case - lists of described objects.

This seems to have been a common pattern in the apps that I've worked 
on. You want to find all objects (resources in RDF speak) that match 
some criteria, with some ordering, and get back a list of them and their 
associated properties. This is like a SPARQL DESCRIBE operating on each 
of an ordered list of resources found by a SPARQL SELECT.

The point is that this is not a graph because the top level list needs 
to be ordered. It is not a SPARQL result set because you want the 
descriptions to include any of the properties that are present in the 
data (potentially included bNode closure) without having to know all 
those and spell them out in the query. But it is a natural thing to want 
to return from a REST API.

> To put this in context, what I think we should aim for is a pure 
> publishing format that is optimised for approachability for normal 
> developers, *not* an interchange format. RDF/JSON [1] and the SPARQL 
> results JSON format [2] aren't entirely satisfactory as far as I'm 
> concerned because of the way the objects of statements are represented 
> as JSON objects rather than as simple values. I still think we should 
> produce them (to wean people on to, and for those using more generic 
> tools), but I'd like to think about producing something that is a bit 
> more immediately approachable too.
> 
> RDFj [3] is closer to what I think is needed here. However, I don't 
> think there's a need for setting 'context' given I'm not aiming for an 
> interchange format, there are no clear rules about how to generate it 
> from an arbitrary graph (basically there can't be without some 
> additional configuration) and it's not clear how to deal with datatypes 
> or languages.

WRT 'context' you might not need it but it I don't think it is harmful. 
  I think if we said to developers that there is some outer wrapper like:

{
    "format" : "RDF-JSON",
    "version" : "0.1",
    "mapping" :  ... magic stuff ...
    "data" : ... the bit you care about ...
}

The developers would be quite happy doing that one dereference and 
ignore the mapping stuff but it might allow inversion back to RDF for 
those few who do care, or come to care.

> I suppose my first question is whether there are any other JSON-based 
> formats that we should be aware of, that we could use or borrow ideas from?

The one that most intrigued me as a possible starting point was the 
Simile Exhibit JSON format [1]. It is developer friendly in much the way 
that you talk about but it has the advantage of zero configuration, some 
measure of invertability, has an online translator [2] and is supported 
by the RPI Sparql proxy [3].

I've some reservations about standardizing on it as is:
  - lack of documentation of the mapping
  - some inconsistencies in how references between resources are encoded 
(at least judging by the output of Babel[2] on test cases)
  - handling of bNodes - I'd rather single referenced bNodes were 
serialized as nested structures

[There was another format we used in a project in my previous existence 
but I'm not sure if that was made public anywhere, will check.]

> Assuming there aren't, I wanted to discuss what generic rules we might 
> use, where configuration is necessary and how the configuration might be 
> done.

One starting assumption to call out: I'd like to aim for a zero 
configuration option and that explicit configuration is only used to 
help tidy things up but isn't required to get started.

> # RDF Graphs #
> 
> Let's take as an example:
> 
>   <http://www.w3.org/TR/rdf-syntax-grammar>
>     dc:title "RDF/XML Syntax Specification (Revised)" ;
>     ex:editor [
>       ex:fullName "Dave Beckett" ;
>       ex:homePage <http://purl.org/net/dajobe/> ;
>     ] .
> 
> In JSON, I think we'd like to create something like:
> 
>   {
>     "$": "http://www.w3.org/TR/rdf-syntax-grammar",
>     "title": "RDF/XML Syntax Specification (Revised)",
>     "editor": {
>       "name": "Dave Beckett",
>       "homepage": "http://purl.org/net/dajobe/"
>     }
>   }

+1 on style

In terms of details I was thinking of following the Simile convention on 
short form naming that, in the absence of clashes, use the rdfs:label 
falling back to the localname, as the basis for the shortened property 
names. So knowing nothing else the bNode would be:

   ...
     "editor": {
        "fullName": "Dave Beckett",
        "homePage": "http://purl.org/net/dajobe/"
     }

In the event of clashes then fall back on a prefix based disambiguation.

> Note that the "$" is taken from RDFj. I'm not convinced it's a good idea 
> to use this symbol, rather than simply a property called "about" or 
> "this" -- any opinions?

I'd prefer "id" (though "about" is OK), "$" is too heavily overused in 
javascript libraries.

> Also note that I've made no distinction in the above between a URI and a 
> literal, while RDFj uses <>s around literals. My feeling is that normal 
> developers really don't care about the distinction between a URI literal 
> and a pointer to a resource, and that they will base the treatment of 
> the value of a property on the (name of) the property itself.

Probably right.

Actually, in your example isn't that value a resource anyway? To make it 
a literal you'd have to have:

   ex:homePage "http://purl.org/net/dajobe/"^^xsd:anyURI

> So, the first piece of configuration that I think we need here is to map 
> properties on to short names that make good JSON identifiers (ie name 
> tokens without hyphens). Given that properties normally have 
> lowercaseCamelCase local names, it should be possible to use that as a 
> default. If you need something more readable, though, it seems like it 
> should be possible to use a property of the property, such as:
> 
>   ex:fullName api:jsonName "name" .
>   ex:homePage api:jsonName "homepage" .

Suggest Simile approach and have api:jsonName or your API as an optional 
extra for resolving problems rather than a requirement.

> However, in any particular graph, there may be properties that have been 
> given the same JSON name (or, even more probably, local name). We could 
> provide multiple alternative names that could be chosen between, but any 
> mapping to JSON is going to need to give consistent results across a 
> given dataset for people to rely on it as an API, and that means the 
> mapping can't be based on what's present in the data. We could do 
> something with prefixes, but I have a strong aversion to assuming global 
> prefixes.
> 
> So I think this means that we need to provide configuration at an API 
> level rather than at a global level: something that can be used 
> consistently across a particular API to determine the token that's used 
> for a given property. For example:
> 
>   <> a api:JSON ;
>     api:mapping [
>       api:property ex:fullName ;
>       api:name "name" ;
>     ] , [
>       api:property ex:homePage ;
>       api:name "homepage" ;
>     ] .

Are you thinking of this as something the publisher provides or the API 
caller provides?

If the former, then OK but as I say I think a zero config set of default 
conventions is OK with the API to allow fine tuning.

> There are four more areas where I think there's configuration we need to 
> think about:
> 
>   * multi-valued properties
>   * typed and language-specific values
>   * nesting objects
>   * suppressing properties
> 
> ## Multi-valued Properties ##
> 
> First one first. It seems obvious that if you have a property with 
> multiple values, it should turn into a JSON array structure. For example:
> 
>   [] foaf:name "Anna Wilder" ;
>     foaf:nick "wilding", "wilda" ;
>     foaf:homepage <http://example.org/about> .
> 
> should become something like:
> 
>   {
>     "name": "Anna Wilder",
>     "nick": [ "wilding", "wilda" ],
>     "homepage": "http://example.org/about"
>   }

+1

> The trouble is that if you determine whether something is an array or 
> not based on the data that is actually available, you'll get situations 
> where the value of a particular JSON property is sometimes an array and 
> sometimes a string; that's bad for predictability for the people using 
> the API. (RDF/JSON solves this by every value being an array, but that's 
> counter-intuitive for normal developers.)
> 
> So I think a second API-level configuration that needs to be made is to 
> indicate which properties should be arrays and which not:
> 
>   <> a api:API ;
>     api:mapping [
>       api:property foaf:nick ;
>       api:name "nick" ;
>       api:array true ;
>     ] .

So if this is not specified in the mapping then you get the 
unpredictable behaviour but by providing a mapping spec you can force 
arrays on single values but not force singletons on multi-values. Is 
that right? If so OK.

There is a related issue: how to represent RDF lists. There are times 
you want ordered property values. At the RDF end the good way to do that 
is to use lists (sorry "collections"). I'd argue that a natural 
representation of:

    <http://example.com/ourpaper>
        ex:authors (
               <http://example.com/people#Jeni>
                <http://example.com/people#Dave
        ) .

is

   {
       "id" : "http://example.com/ourpaper",
       "authors" : [
          "http://example.com/people#Jeni",
          "http://example.com/people#Dave"
       ]
   }

The problem is that this looks just the same as the multi-valued case.

We could:
(1) decide not to care, the mapping can't be inverted
(2) keep this mapping but include context information in the outer 
wrapper that allows the inversion (in uniform cases)
(3) have a separate list notation:

   {
       "id" : "http://example.com/ourpaper",
       "authors" : { "type" : "list", "value" : [
          "http://example.com/people#Jeni",
          "http://example.com/people#Dave"
       ] }
   }

My preference is (2) because I think lists are really useful and should 
be as simple as possible in the JSON translation but think (3) is 
technically cleaner.

> ## Typed Values and Languages ##
> 
> Typed values and values with languages are really the same problem.

Not sure I agree with this, see later.

> If 
> we have something like:
> 
>   <http://statistics.data.gov.uk/id/local-authority-district/00PB>
>     skos:prefLabel "The County Borough of Bridgend"@en ;
>     skos:prefLabel "Pen-y-bont ar Ogwr"@cy ;
>     skos:notation "00PB"^^geo:StandardCode ;
>     skos:notation "6405"^^transport:LocalAuthorityCode .
> 
> then we'd really want the JSON to look something like:
> 
>   {
>     "$": "http://statistics.data.gov.uk/id/local-authority-district/00PB",
>     "name": "The County Borough of Bridgend",
>     "welshName": "Pen-y-bont ar Ogwr",
>     "onsCode": "00PB",
>     "dftCode": "6405"
>   }
> 
> I think that for this to work, the configuration needs to be able to 
> filter values based on language or datatype to determine the JSON 
> property name. Something like:
> 
>   <> a api:JSON ;
>     api:mapping [
>       api:property skos:prefLabel ;
>       api:lang "en" ;
>       api:name "name" ;
>     ] , [
>       api:property skos:prefLabel ;
>       api:lang "cy" ;
>       api:name "welshName" ;
>     ] , [
>       api:property skos:notation ;
>       api:datatype geo:StandardCode ;
>       api:name "onsCode" ;
>     ] , [
>       api:property skos:notation ;
>       api:datatype transport:LocalAuthorityCode ;
>       api:name "dftCode" ;
>     ] .

Neat but ...

Language codes are effectively open ended. I can't necessarily predict 
what lang codes are going to be in my data and provide a property 
mapping for every single one.

Plus when working with language-tagged data you often have code to do a 
"best match" (not simple lookup) between the user's language preferences 
and the available lang tags. That looks hard if each is in a different 
property and the lang tags themselves are hidden in the API configuration.

I think we may need the long winded encoding available:

{
   "id" : "http://statistics.data.gov.uk/id/local-authority-district/00PB",
   "prefLabel" : [
     "The County Borough of Bridgend",
     { "value" : "The County Borough of Bridgend", "lang" : "en" },
     { "value" : "Pen-y-bont ar Ogwr", "lang : "cy" }
   ]
   ...

Then it would up to the publisher whether provide the simpler properties 
as well or instead. But those could be regard as transformations of the 
RDF for convenience (much like choosing to include RDFS closure info).

Turning to data types ...

Your onsCode examples are a particular pattern for how to use datatypes 
which are indeed a similar case to lang tags. But how are you thinking 
of handling the common cases like the XSD types?

I'm assuming that all the number formats would all become JSON numbers 
rather than strings, right? That looses the distinction between say 
xsd:decimal and xsd:float but javascript doesn't care about that and if 
we are not doing an interchange format that's OK.

For things like xsd:dateTime then there seems a couple of options. The 
Simile type option would be to have them as strings but define the range 
of the property in some associated context/properties table.

The other would be to use a structured representation:

   {
       "id" : "http://example.com/ourpaper",
       "date" : { "type" : date, "value" : "20091312"}
      ...

I'm guessing you would just have them as strings and let the consumer 
figure out when they want to treat them as dates, is that right?

> ## Nesting Objects ##
> 
> Regarding nested objects, I'm again inclined to view this as a 
> configuration option rather than something that is based on the 
> available data. For example, if we have:
> 
>   <http://example.org/about>
>     dc:title "Anna's Homepage"@en ;
>     foaf:maker <http://example.org/anna> .
> 
>   <http://example.org/anna>
>     foaf:name "Anna Wilder" ;
>     foaf:homepage <http://example.org/about> .
> 
> this could be expressed in JSON as either:
> 
>   {
>     "$": "http://example.org/about",
>     "title": "Anna's Homepage",
>     "maker": {
>       "$": "http://example.org/anna",
>       "name": "Anna Wilder",
>       "homepage": "http://example.org/about"
>     }
>   }
> 
> or:
> 
>   {
>     "$": "http://example.org/anna",
>     "name": "Anna Wilder",
>     "homepage": {
>       "$": "http://example.org/about",
>       "title": "Anna's Homepage",
>       "maker": "http://example.org/anna"
>     }
>   }

Or:

[
   {
     "id": "http://example.org/about",
     "title": "Anna's Homepage",
     "maker": "http://example.org/anna"
   },

   {
     "id": "http://example.org/anna",
     "name": "Anna Wilder",
     "homepage": "http://example.org/about"
   }
]

> The one that's required could be indicated through the configuration, 
> for example:
> 
>   <> a api:API ;
>     api:mapping [
>       api:property foaf:maker ;
>       api:name "maker" ;
>       api:embed true ;
>     ] .

My zero-configuration default would be to nest single-referenced bNodes 
and have everything else as top level resources with cross-references, 
as above.

> The final thought that I had for representing RDF graphs as JSON was 
> about suppressing properties. Basically I'm thinking that this 
> configuration should work on any graph, most likely one generated from a 
> DESCRIBE query. That being the case, it's likely that there will be 
> properties that repeat information (because, for example, they are a 
> super-property of another property). It will make a cleaner JSON API if 
> those repeated properties aren't included. So something like:
> 
>   <> a api:API ;
>     api:mapping [
>       api:property admingeo:contains ;
>       api:ignore true ;
>     ] .

Seems reasonable but seems a separate issue from the JSON encoding.

> # SPARQL Results #
> 
> I'm inclined to think that creating JSON representations of SPARQL 
> results that are acceptable to normal developers is less important than 
> creating JSON representations of RDF graphs, for two reasons:
> 
>   1. SPARQL naturally gives short, usable, names to the properties in 
> JSON objects
>   2. You have to be using SPARQL to create them anyway, and if you're 
> doing that then you can probably grok the extra complexity of having 
> values that are objects

+1

> Nevertheless, there are two things that could be done to simplify the 
> SPARQL results format for normal developers.
> 
> One would be to just return an array of the results, rather than an 
> object that contains a results property that contains an object with a 
> bindings property that contains an array of the results. People who want 
> metadata can always request the standard SPARQL results JSON format.

This seems quite minor, it's very easy to do the deref.

> The second would be to always return simple values rather than objects. 
> For example, rather than:
> 
>   {
>     "head": {
>       "vars": [ "book", "title" ]
>     },
>     "results": {
>       "bindings": [
>         {
>           "book": {
>             "type": "uri",
>             "value": "http://example.org/book/book6"
>           },
>           "title": {
>             "type": "literal",
>             "value", "Harry Potter and the Half-Blood Prince"
>           }
>         },
>         {
>           "book": {
>             "type": "uri",
>             "value": "http://example.org/book/book5"
>           },
>           "title": {
>             "type": "literal",
>             "value": "Harry Potter and the Order of the Phoenix"
>           }
>         },
>         ...
>       ]
>     }
>   }
> 
> a normal developer would want to just get:
> 
>   [{
>     "book": "http://example.org/book/book6",
>     "title": "Harry Potter and the Half-Blood Prince"
>    },{
>      "book": "http://example.org/book/book5",
>      "title": "Harry Potter and the Order of the Phoenix"
>    },
>    ...
>   ]
> I don't think we can do any configuration here. It means that 
> information about datatypes and languages isn't visible, but (a) I'm 
> pretty sure that 80% of the time that doesn't matter, (b) there's always 
> the full JSON version if people need it and (c) they could write SPARQL 
> queries that used the datatype/language to populate different 
> variables/properties if they wanted to.

+1

> So there you are. I'd really welcome any thoughts or pointers about any 
> of this: things I've missed, vocabularies we could reuse, things that 
> you've already done along these lines, and so on. Reasons why none of 
> this is necessary are fine too, but I'll warn you in advance that I'm 
> unlikely to be convinced ;)

Thanks so much for getting this started and kicking off with such 
detailed suggestions.

Cheers,
Dave

[1] The data model is described at:
http://simile.mit.edu/wiki/Exhibit/Understanding_Exhibit_Database
The JSON page is unhelpful!
http://simile.mit.edu/wiki/Exhibit/Understanding_Exhibit_JSON_Format
But there is some documentation:
http://simile.mit.edu/wiki/Exhibit/Creating,_Importing,_and_Managing_Data
[2] http://simile.mit.edu/babel/
[3] http://data-gov.tw.rpi.edu/ws/sparqlproxy.php
Received on Sunday, 13 December 2009 13:35:34 UTC