Re: Input needed from RDF group on JSON-LD skolemization

On 07/04/2013 12:28 PM, David Booth wrote:
> On 07/04/2013 04:16 AM, Markus Lanthaler wrote:
>> On Thursday, July 04, 2013 3:54 AM, David Booth wrote:
>>>>> Regarding stability, AFAICT relative IRIs would be nearly as 
>>>>> stable as
>>>>> any versioned IRI: the IRI may change if the author decides to 
>>>>> version
>>>>> it, but aside from that it is exactly the same every time the data is
>>>>> generated, even if other data elements are added, etc. That is far
>>>>
>>>> I completely disagree. While technically you are right, the whole 
>>>> point
>> of
>>>> using a bnode is to convey it is in fact *not stable* and is not
>> intended
>>>> to be.
>>>
>>> Again, you may think of blank nodes that way if you wish, but that is
>>> not why they were invented.
>>
>> Just out of curiosity, why have they then been invented if not 
>> provide a way
>> to express some facts about an "entity" that is unknown?
>
> They were invented to allow an RDF author to indicate that an entity 
> is known to exist, and allow facts to be expressed about it.  The fact 
> that bnodes lack a stable identifier was an (unfortunate) by-product 
> -- not the purpose of their invention.
>
>>
>>
>>>> The point is that I don't want them to be stable. I explicitly want to
>>>> prevent that people start to rely on them.
>>>
>>> I suppose that would make sense if your goal is to annoy downstream
>>> consumers of your data, but that's rather anti-social. Making it hard
>>> for others to refer to resources mentioned in your data is widely
>>> viewed
>>> as a *negative* -- not a positive -- and it goes against the philosophy
>>> of the web.
>>
>> That might be true.. but exactly the same applies to bnode subjects and
>> objects. Arguably even more so to subjects. So why do you think 
>> predicates
>> are so special?
>
> Yes, it does apply to subjects and objects also.  Blank node 
> predicates are special because they are not a part of standard RDF.  
> And they are not a part of standard RDF because enough of the working 
> group thought it would not be a good idea to allow blank nodes as 
> predicates, just as enough of the working group thought it would not 
> be a good idea to allow literals as subjects.  That could change, of 
> course, but it cannot change anytime soon, because the RDF working 
> group charter explicitly states that blank node predicates are out of 
> scope.
>
>>
>>
>>>> OK, so what if we would add a "generalizedRDF" flag to the toRDF
>> algorithm
>>>> which, when set to false would filter all quads where a bnode is in
>>>> predicate position? I would prefer the default value to be set to true
>> but
>>>> could, if there's a good argument, also live with a false.
>>>>
>>>> Would that address your concerns?
>>>
>>> Well, no.  An option for extended RDF would be fine (defaulting to
>>> standard RDF), but discarding triples would not be fine, because it
>>> would involve unnecessary information loss.  That would bring us 
>>> back to
>>> figuring out how to avoid that information loss. Skolemization would be
>>> one way to do it, but the use of relative URIs seems like a better
>>> option because it is so much simpler and it gives the additional
>>> benefits (which I understand you do not see as benefits) of more stable
>>> identifiers that could eventually be made dereferenceable.
>>
>> You can't have a syntax which sometimes allows bnode predicates and
>> sometimes doesn't. The only option in that case is to raise an error 
>> when
>> converting to RDF saying that information may be lost because some 
>> generated
>> triples contain bnode predicates. That would be acceptable for me but 
>> I fear
>> it won't satisfy you either.
>
> Right.  So the other option is for JSON-LD to prohibit blank nodes as 
> properties.  Authors could simply use relative IRIs instead.

So I don't consider this situation to be all that different from the one 
where an author elects not to provide any mappings at all for certain 
keys in their JSON. We currently allow this to happen -- and it's an 
important use case for at least two reasons:

1. It allows authors to slowly transition over to using JSON-LD -- 
mapping only those keys in their data that they are ready to, that they 
are confident will be mapped to the correct URL. Also note that JSON 
developers know nothing about owl:sameAs and we don't need to introduce 
them to another level of complexity right out of the gate.

2. It allows authors to use their APIs both as JSON and as JSON-LD. This 
covers two main uses: preventing existing consumers of JSON APIs from 
being messed with whilst allowing servers to upgrade and consolidate 
code paths, and allowing servers to include data that is intended to be 
"private" (not in a security sense) to one particular use of their API 
(eg: for an HTML interface to their data) without exposing it as valid 
data otherwise.

The point of all this is that sometimes authors would prefer data to be 
"lost" in some scenarios, and not in others. If the above option were 
available, it would allow authors to continue this useful practice 
whilst having the default behavior produce fully compliant RDF.

For a more concrete example:

Suppose a server has been serving this JSON for a while:

{
   "foo": "bar",
   "about": {
     "id": "1",
     "name": "Phillip J. Fry"
   },
   "website_status": {
     "editor": {
       "id": "1",
       "changes": 4
     },
     "ad636ee3fb": true
   }
}

Clients that are consuming this data as JSON really only look at "foo" 
and maybe "about", except for the particular website client WC, which 
also makes use of "website_status". The author has communicated, 
out-of-band, that anything starting with "website_" is unstable data 
that should be ignored by consumers of the API.

Now, the author of this data would like to make it consumable as RDF, so 
a change is made to include a @context that appropriately maps "foo", 
"about", "id", and "name" to URLs/aliases. Now any RDF clients (that 
understand JSON-LD) can understand the meaning of those keys. However, 
the author still only uses "website_status" on their local website and 
doesn't want to have to deal with keeping it stable for any clients. 
JSON clients are aware of this but so are RDF clients, as 
"website_status" has no meaning to them; it is dropped by JSON-LD 
processors. No out-of-band communication is necessary for the RDF clients.

Now, suppose the author would like to make the "changes" data found in 
"website_status" available to RDF clients without changing their 
existing JSON structure. They would prefer not to leak indexed hashes of 
private information (that appear as hex JSON keys above) as stable 
predicates in their data. The meaning of those hash predicates or their 
range may change in the future. They'd also prefer not to leak 
"website_status". They may decide to update WC so it can consume RDF, at 
which time perhaps they'd want access to that information, but that's 
not in the plan right now. For now, they'd simply like RDF clients to 
take advantage of the "changes" data.

Can they do this with minimal work on their end?

If the author could map any non-specifically-mapped predicate to a blank 
node, then the author could easily achieve most of the above goals. This 
would allow the deeply-embedded "changes" data to be seen and output by 
a JSON-LD processor. If a JSON-LD processor, by default, dropped blank 
node predicates, they could achieve even more -- as most RDF clients 
would ignore the data that the author would prefer to be ignored. But if 
it can't be ignored, that's not so bad because at least it is only blank 
node data -- there are not mappings to URLs that the author really 
doesn't want. If a JSON-LD processor had an option for keeping those 
blank nodes, then their potential future plans (updating X to an RDF 
client) could also work out, as they'd know to set the special option to 
keep the data they want -- just for their website.

If there is no way to map predicates to blank nodes, then the author has 
to consider other options. If the author uses relative URLs, they'd 
expose predicates that were never intended to be exposed and that have 
semantics that may change. The author wants to be able to innovate and 
play with that particular data before (if ever) it is linked to a stable 
URL. Instead of engaging in what they would consider data pollution, the 
author may instead elect to go through a costly API upgrade path that 
may break existing JSON clients.

I think there are use cases where authors simply aren't "ready" to 
publish *all* their data or would like to reuse the same APIs for 
different purposes. By disallowing blank node predicates we make their 
lives more difficult. Perhaps some of these practices can be described 
as "anti-web" (hiding/siloing information), but I think that there are 
practical uses for them and that a blind opposition to "anti-web" 
practices is not a good policy. This is particularly true for cases 
where an author is actually trying to become less "anti-web", but they 
can't easily get there because it's all or nothing.

-- 
Dave Longley
CTO
Digital Bazaar, Inc.

Received on Tuesday, 9 July 2013 18:21:00 UTC