Re: Bug with @reverse and blank nodes? from Gregg Kellogg on 2013-10-20 (public-linked-json@w3.org from October 2013)

From: Gregg Kellogg <gregg@greggkellogg.com>
Date: Sun, 20 Oct 2013 13:10:43 -0700
To: Markus Lanthaler <markus.lanthaler@gmx.net>
Cc: "'Pierre-Antoine Champin'" <pierre-antoine.champin@liris.cnrs.fr>, <public-linked-json@w3.org>, "'Dave Longley'" <dlongley@digitalbazaar.com>
Message-Id: <470E776C-6806-42EF-8FFF-605CC06BE63F@greggkellogg.com>
On Oct 20, 2013, at 12:23 PM, Markus Lanthaler <markus.lanthaler@gmx.net> wrote:

> On Sunday, October 20, 2013 6:57 PM, Gregg Kellogg wrote:
> On Oct 20, 2013, at 2:21 AM, Pierre-Antoine Champin wrote:
>>> Hi,
>>> I add a surprise today using @reverse with  blank node:
>>> 
>>> 
>>> {
>>>  "@context": {
>>>    "foo": "http://example.org/foo",
>>> 
>>>    "bar": { "@reverse": "http://example.org/", "@type": "@id" }
>>>  },
>>>  "foo": "Foo",
>>>  "bar": "http://example.org/origin"
>>> 
>>> }
>>> 
>>> produces the following graph:
>>> 
>>> 
>>> <http://example.org/origin> <http://example.org/> _:b1 .
>>> 
>>> _:b0 <http://example.org/foo> "Foo" .
>>> 
>>> with *two different bnodes*!...
>>> 
>>> I have the same behaviour with PyLD and the online playground, so may be
>>> this is a deliberate feature of the JSON-LD algorithm (I didn't go and
> check
>>> the API document, I admit...). But if it is so there should be a big
> warning
>>> sign in the syntax document, because this is *very* counter-intuitive.
> 
> No, it's a bug. Good catch
> 
> 
> [...]
>> I think the problem comes when in the flattening algorithm, which results
> in
>> the following node map:
>> 
>> {
>>  "@default": {
>>    "_:b0": {"@id": "_:b0"},
>>    "http://example.org/origin": {
>>      "@id": "http://example.org/origin",
>>      "http://example.org/": [{"@id": "_:b1"}]
>>    },
>>    "_:b1": {"@id": "_:b1"}
>>  }
>> }
>> 
>> I believe it comes down to step 6.1 of the Node Map Generation algorithm
>> [1], which says to generate a new blanknode identifier. It was previously
>> invoked from step 6.8.3.1.2 because of the @reverse keyword. Perhaps there
>> is some flag that needs to be passed so that a new blanknode lebel is not
>> generated? Markus or Dave might have some other opinions on this.
> 
> Exactly. The problem is that the blank node gets relabeled twice in one
> instance, but not in the other. The recursiveness of the algorithm makes it
> difficult to fix this. The best thing would be to label all blank nodes
> during already during expansion as we did before when we still had property
> generators but that's a too big change at this point in time. 
> 
> The alternative (ugly hack) would be to always relabel the blank node twice
> and remember the previous bnode ID so that it can be used to create
> "referenced node" that will then be relabeled the same way.
> 
> The following changes would be necessary to make this work (all in the node
> map generation algorithm):
> 
> 6.1) If element has an @id member, set id *and old id* to its value and
> remove the member from element. If id is a blank node identifier, replace it
> with a newly generated blank node identifier passing id for identifier.
> 6.2) Otherwise, set *old id* to the result of the Generate Blank Node
> Identifier algorithm passing null for identifier. *Then, set id to the
> result of the Generate Blank Node Identifier algorithm passing old id for
> identifier.*
> 
> 6.8.1) Create a JSON object referenced node with a single member @id whose
> value is *old* id.
> 
> Unfortunately that would change the output of most flattening/toRdf tests.
> 
> Hopefully someone has a better idea to fix this without changing a whole
> lot.

I've long been uncomfortable with the need of the algorithm to absolutely re-create the same blank node identifiers, as this seems to ignore the whole concept of a blank node, and imposes extremely strict processing rules that prohibit, for example, parallel processing of large nodes. In the long run, it might be worth describing a bnode bijection algorithm for JSON-LD, similar to that used for graph isomorphism. However, that's probably best left for another day.

A way I could address this in my implementation would be to have Generate Blank Node Identifier return an object rather than a string, such than when the object is serialized (when transforming from the native Ruby representation to a JSON character-sequence) the label is rendered. In this way, the Generate Blank Node Identifier algorithm could accept either a string or a previously generated object representation. We could then insert between steps one and two for the Generate Blank Node Identifier algorithm the following:

[[[
Otherwise, if _identifier_ is the result of  the same  _generator instance_, return _identifier_.
]]]

(_generator instance_ as a replacement for "Between its executions", as noted below).

The entire algorithm, including the bits to return an object representation, might look like the following:

[[[
The algorithm takes a single input variable identifier which may be null. Between its executions, the algorithm needs to keep an identifier map to relabel existing blank node identifiers consistently and a counter to generate new blank node identifiers. ***The resulting _identifier_ is an implementation-specific object tagged with the specific _generator instance_, which when serialized to JSON results in the string value of the identifier.*** The _counter_ is initialized to 0 by default.

1) If _identifier_ is not null and has an entry in the _identifier map_, return the mapped identifier.
2) Otherwise, if _identifier_ is the result of  the same  _generator instance_, return _identifier_.
3) Otherwise, generate a new blank node identifier by concatenating the string _:b and counter.
4) Increment counter by 1.
5) If identifier is not null, create a new entry for identifier in identifier map and set its value to the new blank node identifier.
6) Return the new blank node identifier.
]]]

(Note step 2 and the highlighted bit from the opening paragraph). This would avoid changing any other algorithm, should not cause any existing test result to change, and could be considered an editorial clarification made for PR. Of course, we should also add one or more tests that illustrate this particular behavior.

We might want to clarify the second sentence of the opening paragraph about what we mean by "Between its executions", such that a given execution defines a _generator instance_.

The approach would avoid wholesale re-numbering of existing test cases and I think is a better solution in general. Thoughts?

Gregg

> --
> Markus Lanthaler
> @markuslanthaler
>
Received on Sunday, 20 October 2013 20:11:13 UTC