Re: Creating JSON from RDF from Mark Birbeck on 2009-12-13 (public-lod@w3.org from December 2009)

From: Mark Birbeck <mark.birbeck@webbackplane.com>
Date: Sun, 13 Dec 2009 21:14:07 +0000
To: Jeni Tennison <jeni@jenitennison.com>
Cc: public-lod@w3.org, John Sheridan <John.Sheridan@nationalarchives.gsi.gov.uk>
Message-ID: <640dd5060912131314t97e20em102b116c276b0823@mail.gmail.com>
Hi Jeni,

On Sat, Dec 12, 2009 at 9:42 PM, Jeni Tennison <jeni@jenitennison.com> wrote:
> Hi,
>
> As part of the linked data work the UK government is doing, we're looking at
> how to use the linked data that we have as the basis of APIs that are
> readily usable by developers who really don't want to learn about RDF or
> SPARQL.

Great.


> One thing that we want to do is provide JSON representations of both RDF
> graphs and SPARQL results. I wanted to run some ideas past this group as to
> how we might do that.

Great again. :)

In the work I've been doing, I've concluded that in JSON-world, an RDF
graph should be a JSON object (as explained in RDFj [1], and as you
seem to concur), but also that SPARQL queries should return RDFj
objects too.

In other words, after a lot of playing around I concluded that there
was nothing to be gained from differentiating between representations
of graphs, and the results of queries.


> To put this in context, what I think we should aim for is a pure publishing
> format that is optimised for approachability for normal developers, *not* an
> interchange format. RDF/JSON and the SPARQL results JSON format
> aren't entirely satisfactory as far as I'm concerned because of the way the
> objects of statements are represented as JSON objects rather than as simple
> values. I still think we should produce them (to wean people on to, and for
> those using more generic tools), but I'd like to think about producing
> something that is a bit more immediately approachable too.

+72. :)

I would also put irJSON into this category, which I see is referred to
in a later post.


> RDFj is closer to what I think is needed here.

Good.


> However, I don't think
> there's a need for setting 'context' given I'm not aiming for an interchange
> format, there are no clear rules about how to generate it from an arbitrary
> graph (basically there can't be without some additional configuration) and
> it's not clear how to deal with datatypes or languages.

I probably didn't explain the use of 'context' well enough, but since
I think you do need it, I'll explain it more, below.


> I suppose my first question is whether there are any other JSON-based
> formats that we should be aware of, that we could use or borrow ideas from?

I also did a thorough search, before devising RDFj.

In general I found many 'interchange formats', as you call them, but I
didn't find anything that came from the other direction, saying 'how
should we interpret JavaScript data as RDF'.

I think this approach is what makes RDFj different, because it is
trying as much as possible to leave the JS alone, and provide a layer
of /interpretation/. It's exactly how I approached RDFa -- I began
with HTML mark-up, such as <link>, <a> and <meta>, and asked myself
what an RDF interpretation of each 'pattern' would be.

(Which incidentally shows that there is plenty more work that could be
done on this, in RDFa; what is an RDF interpretation of @cite,
<blockquote>, and <img>, for example?)


> Assuming there aren't, I wanted to discuss what generic rules we might use,
> where configuration is necessary and how the configuration might be done.

Excellent. It may sound sacrilegious, but I happen to think that in
lots of ways RDFj is more important than RDFa. Consequently, I was
beginning to worry that everyone was quite happy with the 'interchange
formats', and didn't see the point of discussing a more 'natural' JSON
approach!


> # RDF Graphs #
>
> Let's take as an example:
>
>  <http://www.w3.org/TR/rdf-syntax-grammar>
>    dc:title "RDF/XML Syntax Specification (Revised)" ;
>    ex:editor [
>      ex:fullName "Dave Beckett" ;
>      ex:homePage <http://purl.org/net/dajobe/> ;
>    ] .
>
> In JSON, I think we'd like to create something like:
>
>  {
>    "$": "http://www.w3.org/TR/rdf-syntax-grammar",
>    "title": "RDF/XML Syntax Specification (Revised)",
>    "editor": {
>      "name": "Dave Beckett",
>      "homepage": "http://purl.org/net/dajobe/"
>    }
>  }

Definitely.

Key things are that -- other than the subject -- it's very familiar to
JS programmers. In particular, there's no verbose use of 'name' and
'value' properties to demarcate the predicates and objects, in the way
that 'interchange formats' do. Also, there are no explicit bnodes to
indicate that one statement's subject is another's object -- the
natural flow of JavaScript is used.

These were key design goals for RDFj.


> Note that the "$" is taken from RDFj. I'm not convinced it's a good idea to
> use this symbol, rather than simply a property called "about" or "this" --
> any opinions?

I agree, and in my RDFj description I do say that since '$' is used in
a lot of Ajax libraries, I should find something else.

However, in my view, the 'something else' shouldn't look like a
predicate, so I don't think 'about' or 'this' (or 'id' as someone
suggests later in the thread), should be used. (Note also that 'id' is
used in a related but slightly different way by Dojo.)

Also, the underscore is generally related to bnodes, so it might be
confusing on quick reads through. (We have a JSON audience and an RDF
audience, and need to make design decisions with both in mind.)

I've often thought about the empty string, '@' and other
possibilities, but haven't had a chance to try them out. E.g., the
empty string would simply look like this:

  {
    "": "http://www.w3.org/TR/rdf-syntax-grammar",
      "title": "RDF/XML Syntax Specification (Revised)",
      "editor": {
        "name": "Dave Beckett",
        "homepage": "http://purl.org/net/dajobe/"
      }
  }

Since I always tend to indent the predicates in RDFj anyway, just to
draw attention to them, then the empty string is reasonably visible.
However, "@" would be even more obvious:

  {
    "@": "http://www.w3.org/TR/rdf-syntax-grammar",
      "title": "RDF/XML Syntax Specification (Revised)",
      "editor": {
        "name": "Dave Beckett",
        "homepage": "http://purl.org/net/dajobe/"
      }
  }

Anyway, it shouldn't be that difficult to come up with something.


> Also note that I've made no distinction in the above between a URI and a
> literal, while RDFj uses <>s around literals. My feeling is that normal
> developers really don't care about the distinction between a URI literal and
> a pointer to a resource, and that they will base the treatment of the value
> of a property on the (name of) the property itself.

That's true, but I think we gain a lot by making the distinction. I'd
also suggest that for JS authors it's not a difficult thing to grasp.

Also, it's not just URIs that would use a richer syntax; although it
hasn't yet been implemented in my parser, my plan for RDFj has always
been to use N3-like notation inside the attributes, such as for
languages:

  {
    "name": [ "Ivan Herman", "Herman Iván@hu" ]
  }

My thinking was also that an RDFj 'processor' would tweak the objects,
to make some of this RDF metadata available to programmers. For
example:

  var foo = RDFJ.import({
    "name": [
      "Ivan Herman",
      "Herman Iván@hu"
    ]
  });

  assert(foo.name[0] === "Ivan Herman");
  assert(foo.name[1] === "Herman Iván@hu");

  assert(foo.name[0].value === "Ivan Herman");
  assert(foo.name[1].value === "Herman Iván");
  assert(foo.name[1].lang === "hu");

The same principle would apply to other data types and URIs.

(Note that there is nothing to stop you leaving off the angle
brackets, if you know that you'll manage all of the processing
yourself. The point is that RDFj intends to provide *both* a 'JSON as
RDF' technique, *and* an 'RDF as JSON' technique.)


> So, the first piece of configuration that I think we need here is to map
> properties on to short names...

That's what 'tokens' in the 'context' object do, in RDFj.


> ... that make good JSON identifiers (ie name tokens
> without hyphens). Given that properties normally have
> lowercaseCamelCase local names, it should be possible
> to use that as a default.

I don't follow why you have this requirement (no hyphens) -- where
does it come from?

Anyway, in RDFj you don't need to abbreviate the predicates:

  {
    "http://xmlns.com/foaf/0.1/name": "Dave Beckett"
    "http://xmlns.com/foaf/0.1/homepage": "<http://purl.org/net/dajobe/>"
  }

But of course, you can abbreviate them if you want to.


> If you need
> something more readable, though, it seems like it should be possible to use
> a property of the property, such as:
>
>  ex:fullName api:jsonName "name" .
>  ex:homePage api:jsonName "homepage" .

The simplest technique to provide token mappings is to take the
mappings out of the graph. That's what RDF/XML does by using namespace
prefixes, N3 does by using @prefix, and RDFj does by using the
'context' object:

  {
    context: {
      token: {
        "http://xmlns.com/foaf/0.1/homepage": "homepage",
        "http://xmlns.com/foaf/0.1/name": "name"
      }
    },
    "name": "Dave Beckett",
    "homepage": "<http://purl.org/net/dajobe/>"
  }

Of course, the token being mapped can be anything, and what's quite
handy about this, is that JSON objects over which we have no control
could still be converted to RDF, by simply adding a context object.

For example, if some service returned:

  {
    "fullName": "Dave Beckett",
    "url": "<http://purl.org/net/dajobe/>"
  }

to convert this to RDF we assign it to a variable, and add a context:

  var foo = goGetSomeData( url );

  foo.context = {
    token: {
      "http://xmlns.com/foaf/0.1/homepage": "url",
      "http://xmlns.com/foaf/0.1/name": "fullName"
    }
  };

The foo object can now be interpreted as RDF, via RDFj.


> However, in any particular graph, there may be properties that have been
> given the same JSON name (or, even more probably, local name). We could
> provide multiple alternative names that could be chosen between, but any
> mapping to JSON is going to need to give consistent results across a given
> dataset for people to rely on it as an API, and that means the mapping can't
> be based on what's present in the data. We could do something with prefixes,
> but I have a strong aversion to assuming global prefixes.

I'm not sure here whether the goal is to map /any/ API to RDF, but if
it is I think that's a separate problem to the 'JSON as RDF' question.

In passing, my approach to converting feeds -- for example a Twitter
feed -- into RDF is to make use of named graph support in SPARQL
queries, and then provide a few triples that describe how a URI that
appears in a SPARQL query -- as a named graph URI -- should be
processed to obtain triples. I call these 'named graph mappers' [2].
There's a lot more that can be done in this area, but the key thing is
that much of the information that you are referring to, that guides
the processing, should in my view be at the query level.


> So I think this means that we need to provide configuration at an API level
> rather than at a global level: something that can be used consistently
> across a particular API to determine the token that's used for a given
> property. For example:
>
>  <> a api:JSON ;
>    api:mapping [
>      api:property ex:fullName ;
>      api:name "name" ;
>    ] , [
>      api:property ex:homePage ;
>      api:name "homepage" ;
>    ] .

The advantage of the RDFj solution (using context.tokens) is that the
mappings travel with the data, i.e., it is independent of any API.


> There are four more areas where I think there's configuration we need to
> think about:
>
>  * multi-valued properties
>  * typed and language-specific values
>  * nesting objects
>  * suppressing properties
>
> ## Multi-valued Properties ##
>
> First one first. It seems obvious that if you have a property with multiple
> values, it should turn into a JSON array structure. For example:
>
>  [] foaf:name "Anna Wilder" ;
>    foaf:nick "wilding", "wilda" ;
>    foaf:homepage <http://example.org/about> .
>
> should become something like:
>
>  {
>    "name": "Anna Wilder",
>    "nick": [ "wilding", "wilda" ],
>    "homepage": "http://example.org/about"
>  }
>

Right. For those who haven't read the RDFj proposal [1], this example
is taken from there (although in my version I have angle brackets on
the resource -- see above).


> The trouble is that if you determine whether something is an array or not
> based on the data that is actually available, you'll get situations where
> the value of a particular JSON property is sometimes an array and sometimes
> a string; that's bad for predictability for the people using the API.
> (RDF/JSON solves this by every value being an array, but that's
> counter-intuitive for normal developers.)

I'm not quite fully understanding the problem is here...sorry about
that. The difficulty I have is that I can read what you're saying in
two ways.

One interpretion is that, given the following JavaScript:

  {
    "name": [ "Ivan Herman", "Herman Iván@hu" ]
  }

there is no way to tell whether the RDF representation should be two
triples, where each object is a literal (N3):

  [
    foaf:name "Ivan Herman", "Herman Iván@hu"
  ] .

or one triple where the object is a JSON array (N3 again):

  [
    foaf:name "'Ivan Herman', 'Herman Iván@hu'"^^json:arrary
  ] .

I don't /think/ this is what you are saying, but if it is, I think the
first case is easily the most useful, and so we should just assume
that all arrays represent multiple predicates (i.e., it's like the
comma in N3).

The second possible interpretation is that, when working with a JSON
object, developers need to know when a property can hold an array, and
when it will be a single value.

If that's what you mean, then I'll flag up the approach I've taken in
my RDFj processor, which is to *always* test for an array, and then
operate on a single value; if something is an array, generate a list
of triples, and if not, take only one item.

For programmers working with the JSON object it's much the same; if we
simply say that every value can have either a single value or an
array, then it's pretty straightforward for them to then deal with
that. This then gives us great flexibility, since everything can have
multiple values as appropriate.

But I realise I could have misunderstood this particular point, so
apologies if so.


> So I think a second API-level configuration that needs to be made is to
> indicate which properties should be arrays and which not:
>
>  <> a api:API ;
>    api:mapping [
>      api:property foaf:nick ;
>      api:name "nick" ;
>      api:array true ;
>    ] .

I think this over-complicates things, and since most JS programmers
can work it out themselves (by testing the type of the data), I'm not
sure they will thank you for the extra information anyway. :)


> ## Typed Values and Languages ##
>
> Typed values and values with languages are really the same problem. If we
> have something like:
>
>  <http://statistics.data.gov.uk/id/local-authority-district/00PB>
>    skos:prefLabel "The County Borough of Bridgend"@en ;
>    skos:prefLabel "Pen-y-bont ar Ogwr"@cy ;
>    skos:notation "00PB"^^geo:StandardCode ;
>    skos:notation "6405"^^transport:LocalAuthorityCode .
>
> then we'd really want the JSON to look something like:
>
>  {
>    "$": "http://statistics.data.gov.uk/id/local-authority-district/00PB",
>    "name": "The County Borough of Bridgend",
>    "welshName": "Pen-y-bont ar Ogwr",
>    "onsCode": "00PB",
>    "dftCode": "6405"
>  }
>
> I think that for this to work, the configuration needs to be able to filter
> values based on language or datatype to determine the JSON property name.
> Something like:
>
>  <> a api:JSON ;
>    api:mapping [
>      api:property skos:prefLabel ;
>      api:lang "en" ;
>      api:name "name" ;
>    ] , [
>      api:property skos:prefLabel ;
>      api:lang "cy" ;
>      api:name "welshName" ;
>    ] , [
>      api:property skos:notation ;
>      api:datatype geo:StandardCode ;
>      api:name "onsCode" ;
>    ] , [
>      api:property skos:notation ;
>      api:datatype transport:LocalAuthorityCode ;
>      api:name "dftCode" ;
>    ] .

Of course there are many ways to skin a cat, so I don't want to rule
this out of court. But to me it's just way too RDF-like.

First, I think over the longer term we could actually get JS authors
to accept extra data being added to strings, like this:

  {
    "$": "http://statistics.data.gov.uk/id/local-authority-district/00PB",
      "name": [
        "The County Borough of Bridgend",
        "Pen-y-bont ar Ogwr@cy"
      ]
  }

You might respond that this is also RDF-like, but I think it's a
different degree. I think there's a great deal of value in JavaScript
in being able to indicate what language something is, independent of
RDFj, or other solutions.

But also, as described earlier, at a programmatic level, we could
provide developers with extra properties, like this:

  var s = "Pen-y-bont ar Ogwr@cy";

  assert(s === "Pen-y-bont ar Ogwr@cy");
  assert(s.value === "Pen-y-bont ar Ogwr");
  assert(s.lang === "cy");


> ## Nesting Objects ##
>
> Regarding nested objects, I'm again inclined to view this as a configuration
> option rather than something that is based on the available data. For
> example, if we have:
>
>  <http://example.org/about>
>    dc:title "Anna's Homepage"@en ;
>    foaf:maker <http://example.org/anna> .
>
>  <http://example.org/anna>
>    foaf:name "Anna Wilder" ;
>    foaf:homepage <http://example.org/about> .
>
> this could be expressed in JSON as either:
>
>  {
>    "$": "http://example.org/about",
>    "title": "Anna's Homepage",
>    "maker": {
>      "$": "http://example.org/anna",
>      "name": "Anna Wilder",
>      "homepage": "http://example.org/about"
>    }
>  }
>
> or:
>
>  {
>    "$": "http://example.org/anna",
>    "name": "Anna Wilder",
>    "homepage": {
>      "$": "http://example.org/about",
>      "title": "Anna's Homepage",
>      "maker": "http://example.org/anna"
>    }
>  }
>
> The one that's required could be indicated through the configuration, for
> example:
>
>  <> a api:API ;
>    api:mapping [
>      api:property foaf:maker ;
>      api:name "maker" ;
>      api:embed true ;
>    ] .

I realise that the two serialisations of RDFj would not be the same,
but I'm not seeing what difference that would make.

Are you thinking that someone might write some code that relies on the
structure of the object, and then gets thrown by a change in
structure?

I guess that's true, but in my work on RDFj, I had come to the
conclusion that people would write processors that deal with little
blocks of the data, and then call those as and when. So in your
example, if we wrote a processor that handled the object attached to
the predicate 'maker' and another processor for the predicate
'homepage', then it shouldn't really matter in which order the data
appeared, the correct processors would just be called.

But also -- and this may be a key difference in our view on a possible
architecture -- I place all RDFj /received/ into a triple store,
alongside any other triples, including RDFa-generated ones, and then
the author retrieves RDFj from this triple store, and consequently can
structure the data in whatever way they prefer.


> The final thought that I had for representing RDF graphs as JSON was about
> suppressing properties. Basically I'm thinking that this configuration
> should work on any graph, most likely one generated from a DESCRIBE query.
> That being the case, it's likely that there will be properties that repeat
> information (because, for example, they are a super-property of another
> property). It will make a cleaner JSON API if those repeated properties
> aren't included. So something like:
>
>  <> a api:API ;
>    api:mapping [
>      api:property admingeo:contains ;
>      api:ignore true ;
>    ] .

I think we need to be thinking about how to get closer to SPARQL here,
though. (Actually this point is the same for a number of the other
circumstances.)

My fear here is that we're either duplicating the functionality that
can be provided within a SPARQL query, or we're adding a layer above
it, when actually the information should be expressed at the query
layer.

I'm not suggesting that JavaScript authors should have to get involved
with SPARQL. But if we imagine SPARQL recast for JavaScript, then I
think it's there that the kinds of things you want should be
described, and not in the API.

Perhaps we should look at the query side, and then move some of your
constraints into that?


> # SPARQL Results #
>
> I'm inclined to think that creating JSON representations of SPARQL results
> that are acceptable to normal developers is less important than creating
> JSON representations of RDF graphs, for two reasons:
>
>  1. SPARQL naturally gives short, usable, names to the properties in JSON
> objects
>  2. You have to be using SPARQL to create them anyway, and if you're doing
> that then you can probably grok the extra complexity of having values that
> are objects

I think the two things are inseparable, but that's probably because,
as I say, I put all data into a JavaScript triple store, and then
query it with a SPARQL-ish JavaScript syntax.

Currently I get back simple JSON objects that have properties that are
the name of the values used in the query, but my plan is to converge
the query results with my RDFj work, so that it's RDFj in, and RDFj
out.

I've said this before, I know, but what's great about this technique
is that the query engine becomes a 'JSON-object creator', and I think
this is a very powerful programming paradigm.


> Nevertheless, there are two things that could be done to simplify the SPARQL
> results format for normal developers.
>
> One would be to just return an array of the results, rather than an object
> that contains a results property that contains an object with a bindings
> property that contains an array of the results. People who want metadata can
> always request the standard SPARQL results JSON format.
>
> The second would be to always return simple values rather than objects. For
> example, rather than:
>
>  {
>    "head": {
>      "vars": [ "book", "title" ]
>    },
>    "results": {
>      "bindings": [
>        {
>          "book": {
>            "type": "uri",
>            "value": "http://example.org/book/book6"
>          },
>          "title": {
>            "type": "literal",
>            "value", "Harry Potter and the Half-Blood Prince"
>          }
>        },
>        {
>          "book": {
>            "type": "uri",
>            "value": "http://example.org/book/book5"
>          },
>          "title": {
>            "type": "literal",
>            "value": "Harry Potter and the Order of the Phoenix"
>          }
>        },
>        ...
>      ]
>    }
>  }
>
> a normal developer would want to just get:
>
>  [{
>    "book": "http://example.org/book/book6",
>    "title": "Harry Potter and the Half-Blood Prince"
>   },{
>     "book": "http://example.org/book/book5",
>     "title": "Harry Potter and the Order of the Phoenix"
>   },
>   ...
>  ]

Yes, that's what I do in my query engine. As I say, this makes
querying a triple-store into a 'dynamic object creation' mechanism,
and I'm convinced that JS programmers will grok this pretty easily.

But also, since the input to the triple-store can be RDFj, and this
output can also be RDFj, it makes it very easy to move data around.


> I don't think we can do any configuration here. It means that information
> about datatypes and languages isn't visible...

I think it can be, using the techniques I've explained above.


> ... but (a) I'm pretty sure that
> 80% of the time that doesn't matter, (b) there's always the full JSON
> version if people need it and (c) they could write SPARQL queries that used
> the datatype/language to populate different variables/properties if they
> wanted to.

I think we should strive to preserve all of the information.


> So there you are. I'd really welcome any thoughts or pointers about any of
> this: things I've missed, vocabularies we could reuse, things that you've
> already done along these lines, and so on. Reasons why none of this is
> necessary are fine too, but I'll warn you in advance that I'm unlikely to be
> convinced ;)

+94!

I really agree with you, Jeni, and I really think this whole space is
incredibly important for semweb applications.

My feeling is that RDFj is largely there, but that the place where
most of the issues you raise should be resolved is in a 'JSON query'
layer.

I've been working on something I've called jSPARQL, which uses JSON
objects to express queries, but it needs quite a bit more work to get
to something that would feel 'comfortable' to a JavaScript programmer
-- perhaps you'd be interested in helping to get that into shape?

Regards,

Mark

[1] <http://code.google.com/p/backplanejs/wiki/Rdfj>
[2] <http://code.google.com/p/backplanejs/wiki/CreateNamedGraphMapper>

--
Mark Birbeck, webBackplane

mark.birbeck@webBackplane.com

http://webBackplane.com/mark-birbeck

webBackplane is a trading name of Backplane Ltd. (company number
05972288, registered office: 2nd Floor, 69/85 Tabernacle Street,
London, EC2A 4RR)
Received on Sunday, 13 December 2009 21:14:44 UTC