Re: Input needed from RDF group on JSON-LD skolemization from Dave Longley on 2013-07-10 (public-linked-json@w3.org from July 2013)

From: Dave Longley <dlongley@digitalbazaar.com>
Date: Tue, 09 Jul 2013 21:44:12 -0400
To: David Booth <david@dbooth.org>
CC: Markus Lanthaler <markus.lanthaler@gmx.net>, public-linked-json@w3.org
Message-ID: <51DCBC6C.7070509@digitalbazaar.com>
On 07/09/2013 04:27 PM, David Booth wrote:
> Hi Dave,
>
> Thanks for the example.  Responses below . . .
>
> On 07/09/2013 02:20 PM, Dave Longley wrote:
>> On 07/04/2013 12:28 PM, David Booth wrote:
>>> [ . .  ] the other option is for JSON-LD to prohibit blank nodes as
>>> properties.  Authors could simply use relative IRIs instead.
>>
>> So I don't consider this situation to be all that different from the one
>> where an author elects not to provide any mappings at all for certain
>> keys in their JSON. We currently allow this to happen -- and it's an
>> important use case for at least two reasons:
>>
>> 1. It allows authors to slowly transition over to using JSON-LD --
>> mapping only those keys in their data that they are ready to, that they
>> are confident will be mapped to the correct URL. Also note that JSON
>> developers know nothing about owl:sameAs and we don't need to introduce
>> them to another level of complexity right out of the gate.
>
> Right, but you are assuming that they understand the notion of 
> "unstable" or "private" (not in a security sense).  Thus it seems 
> perfectly reasonable to generate URIs that are explicitly marked as 
> unstable, so that downstream consumers will not complain if they change:
>
>   ...
>   "@context": {
>     "@vocab": "UNSTABLE/"
>   }
>   ...

This seems antithetical to Linked Data -- creating URLs that are 
intended to be thrown away or have their semantics change. That being 
said, I'd actually be ok with that approach for the practicality of it. 
I just can't seem to reconcile why you would be given your arguments 
regarding "anti-web" practices. This seems more "anti-web" to me than 
the alternative approach.

In any event, that value for @vocab would have to be changed to an 
absolute IRI (but could obviously still include the string "UNSTABLE"), 
if I'm not mistaken. Unfortunately, this brings along the need for 
out-of-band information, which would otherwise have been eliminated by 
switching from JSON to JSON-LD.

>
>>
>> 2. It allows authors to use their APIs both as JSON and as JSON-LD. This
>> covers two main uses: preventing existing consumers of JSON APIs from
>> being messed with whilst allowing servers to upgrade and consolidate
>> code paths, and allowing servers to include data that is intended to be
>> "private" (not in a security sense) to one particular use of their API
>> (eg: for an HTML interface to their data) without exposing it as valid
>> data otherwise.
>>
>> The point of all this is that sometimes authors would prefer data to be
>> "lost" in some scenarios, and not in others.
>
> If the author wants the data to be lost, it should be omitted entirely 
> or encrypted -- not included using blank nodes.

Well, it's more like the author would prefer it to be lost, but doesn't 
really care if it isn't so long as it doesn't come back to haunt them 
later (data pollution). My concern with minting URLs for this data is 
that it gives it an air of permanence that, otherwise, IMO, would not 
exist. I imagine that a significant number of JSON developers would feel 
similarly, but that's subjective.

>
> There is an important difference between stating that an identifier or 
> data element is unstable (and hence downstream consumers should not 
> rely on it to remain the same) and intentionally making it difficult 
> for downstream consumers to use the data.

Well, it's arguable whether it's significantly more difficult for 
JSON-LD consumers, but perhaps for RDF consumers (using another syntax) 
it would be.

>   If the data is included in the JSON-LD document, and marked as 
> UNSTABLE, it is the downstream consumer's business what they try to do 
> with that data -- not the author's business.

Well, the use case is that the author is trying to mix data for 
downstream consumers and for private-use via the same API (because it's 
very convenient to do so). When this data is just JSON, they really 
don't care whether or not someone accesses/uses that private data. There 
are no ramifications for a downstream consumer using "private" JSON. 
There could be, however, ramifications for a downstream consumer using 
JSON-LD. Once someone else starts making assertions about "unstable" 
URLs, or if enough people do, this could become a real problem for the 
author, who never intended to have to support that data. They may cause 
anger or frustration by changing the meaning of certain URLs (or by 
removing them) even though they told downstream consumers that the URLs 
were unstable. This could put an author in an uncomfortable position and 
people won't say: "Oh, well, those URLs were unstable from the 
beginning". Well, I'm sure someone will say that, but good luck with 
that argument. The author is more likely to be coerced into doing 
something they didn't want so as to avoid destroying now useful data -- 
despite the fact that the bad behavior didn't begin with them. It's 
easier to avoid or fix this from the other approach, IMO.

>   The author *should* make clear that unstable data is not supported, 
> but the author should not make it gratuitously more difficult for 
> downstream consumers to use that data.
>
> Blank nodes are *not* the right mechanism to use to prevent downstream 
> consumers from using the data.  They do not prevent it from being 
> used, they just make it harder.
>
>> If the above option were
>> available, it would allow authors to continue this useful practice
>> whilst having the default behavior produce fully compliant RDF.
>>
>> For a more concrete example:
>>
>> Suppose a server has been serving this JSON for a while:
>>
>> {
>>    "foo": "bar",
>>    "about": {
>>      "id": "1",
>>      "name": "Phillip J. Fry"
>>    },
>>    "website_status": {
>>      "editor": {
>>        "id": "1",
>>        "changes": 4
>>      },
>>      "ad636ee3fb": true
>>    }
>> }
>>
>> Clients that are consuming this data as JSON really only look at "foo"
>> and maybe "about", except for the particular website client WC, which
>> also makes use of "website_status". The author has communicated,
>> out-of-band, that anything starting with "website_" is unstable data
>> that should be ignored by consumers of the API.
>>
>> Now, the author of this data would like to make it consumable as RDF, so
>> a change is made to include a @context that appropriately maps "foo",
>> "about", "id", and "name" to URLs/aliases. Now any RDF clients (that
>> understand JSON-LD) can understand the meaning of those keys. However,
>> the author still only uses "website_status" on their local website and
>> doesn't want to have to deal with keeping it stable for any clients.
>> JSON clients are aware of this but so are RDF clients, as
>> "website_status" has no meaning to them; it is dropped by JSON-LD
>> processors. No out-of-band communication is necessary for the RDF 
>> clients.
>>
>> Now, suppose the author would like to make the "changes" data found in
>> "website_status" available to RDF clients without changing their
>> existing JSON structure. They would prefer not to leak indexed hashes of
>> private information (that appear as hex JSON keys above) as stable
>> predicates in their data. The meaning of those hash predicates or their
>> range may change in the future. They'd also prefer not to leak
>> "website_status". They may decide to update WC so it can consume RDF, at
>> which time perhaps they'd want access to that information, but that's
>> not in the plan right now. For now, they'd simply like RDF clients to
>> take advantage of the "changes" data.
>>
>> Can they do this with minimal work on their end?
>
> Sure, and they can do it by adding a context, without the use of blank 
> node properties and without changing the JSON content:
>
>   {
>     "@context": {
>       "foo":  "http://example/stable/foo",
>       "about":  "http://example/stable/about",
>       "changes":  "http://example/stable/changes",
>       "@vocab": "UNSTABLE/"
>       # Or:  "@vocab": "http://example/UNSTABLE/"
>     },
>      "foo": "bar",
>      "about": {
>        "@id": "1",
>        "name": "Phillip J. Fry"
>      },
>      "website_status": {
>        "editor": {
>          "@id": "1",
>          "changes": 4
>        },
>        "ad636ee3fb": true
>      }
>   }
>
> How would this be done if blank nodes were permitted as properties? 
> (You neglected to show that.)
>
> I was unable to determine from the JSON-LD spec whether @vocab could 
> be used to specify a blank node prefix.  (Can it?)

Yes it can, I'm sorry I left that out -- it was the same idea from an 
earlier example by Markus. Also, he recently fixed the spec to clarify this.

>   If not, then it seems to me that to achieve this with blank nodes, 
> the JSON content would have to be changed, whereas it does not have to 
> be changed if URIs are used.  That would be a *major* advantage of 
> URIs over blank nodes in this example.
>
> SIDE NOTE: I also did not see anything in the JSON-LD spec that would 
> prohibit @vocab from specifying a relative IRI such as
>
>   "@vocab": "UNSTABLE/"
>
> Section B.7 says "If the context definition has a @vocab key, its 
> value MUST be a absolute IRI, a compact IRI, a term, or null.":
> http://json-ld.org/spec/latest/json-ld/#context-definitions
>
> However, I notice that the playground expects absolute IRIs, so I 
> don't know if I missed something or the playground is wrong:
> http://json-ld.org/playground/

@vocab must be an absolute IRI (or be mapped to one via the @context) or 
a blank node identifier. Nearly every place where an absolute IRI can be 
used in JSON-LD, so can a blank node identifier -- and because the check 
we use to detect either is the same (we simply look for a colon), we had 
a few bugs in the spec where we neglected to mention blank node 
identifier. Using one for @vocab was one of these remaining bugs, 
hopefully the last.

>
>>
>> If the author could map any non-specifically-mapped predicate to a blank
>> node, then the author could easily achieve most of the above goals. This
>> would allow the deeply-embedded "changes" data to be seen and output by
>> a JSON-LD processor. If a JSON-LD processor, by default, dropped blank
>> node predicates, they could achieve even more -- as most RDF clients
>> would ignore the data that the author would prefer to be ignored. But if
>> it can't be ignored, that's not so bad because at least it is only blank
>> node data -- there are not mappings to URLs that the author really
>> doesn't want. If a JSON-LD processor had an option for keeping those
>> blank nodes, then their potential future plans (updating X to an RDF
>> client) could also work out, as they'd know to set the special option to
>> keep the data they want -- just for their website.
>
> That's getting pretty contrived, to say that you want some clients to 
> drop the information and others to retain the information.  I think it 
> is the clients' business to decide what they wish to do with the 
> information -- whether to keep it or drop it.

Well, it's more like you tell clients that certain data can (or should) 
be ignored. They don't have to ignore it, but like I said above, if it's 
JSON this is a non-issue. When it's Linked Data, it could become a 
problem -- and we're discussing transitioning JSON APIs to JSON-LD APIs.

>
>>
>> If there is no way to map predicates to blank nodes, then the author has
>> to consider other options. If the author uses relative URLs, they'd
>> expose predicates that were never intended to be exposed and that have
>> semantics that may change. The author wants to be able to innovate and
>> play with that particular data before (if ever) it is linked to a stable
>> URL.
>
> Yes, but that is the whole point of marking certain APIs, names or 
> data elements as "unstable" or "private".  Developers already 
> understand that concept.  Blank nodes are not needed or intended for that.

I think this may be trickier than you're suggesting when it comes to 
Linked Data. I could be wrong. When other people start making assertions 
about data that lives out on the Web, i.e. there's a URL for it, I think 
there's an expectation for that data to have greater persistence than if 
there's no such URL. It doesn't matter if an author warns people that it 
might go away when that warning is ignored by enough people.

>
>> Instead of engaging in what they would consider data pollution, the
>> author may instead elect to go through a costly API upgrade path that
>> may break existing JSON clients.
>
> Any developer knows that if they rely on an API that is explicitly 
> marked as "private" or "unstable" they do so at their own risk.

Again, I'm more uneasy about this when the identifiers for the data 
become URLs/IRIs -- and when JSON developers are being told about 
"Linked Data" and the benefits of what all that entails. I'd rather see 
document-local identifiers used in this case.

>
>>
>> I think there are use cases where authors simply aren't "ready" to
>> publish *all* their data or would like to reuse the same APIs for
>> different purposes. By disallowing blank node predicates we make their
>> lives more difficult.
>
> Based on the examples shown, I do not see it as being significantly 
> more difficult.  AFAICT there would not be much difference in what the 
> author would have to do, whether converting properties to blank nodes 
> or converting them to relative or unstable URIs.

The real problem is in what the author may be coerced into doing later.

>
>> Perhaps some of these practices can be described
>> as "anti-web" (hiding/siloing information), but I think that there are
>> practical uses for them and that a blind opposition to "anti-web"
>> practices is not a good policy. This is particularly true for cases
>> where an author is actually trying to become less "anti-web", but they
>> can't easily get there because it's all or nothing.
>>
>
> But as I showed above, I don't see it as all or nothing, as the author 
> can achieve a similar result with URIs.
>
> I still have not yet seen an example in which blank node properties 
> really seem to be needed.  AFAICT the use of URIs has the net benefits 
> of: (a) being friendlier to downstream RDF processing; (b) resulting 
> in standard RDF; and (c) avoiding information loss.
>
> David
>


-- 
Dave Longley
CTO
Digital Bazaar, Inc.
Received on Wednesday, 10 July 2013 01:44:57 UTC