JSON-LD bnode canonical naming algorithm from Dave Longley on 2011-05-28 (public-linked-json@w3.org from May 2011)

From: Dave Longley <dlongley@digitalbazaar.com>
Date: Sat, 28 May 2011 14:43:58 -0400
To: public-linked-json@w3.org
Message-ID: <4DE1426E.5040105@digitalbazaar.com>
There's been some talk about trying to canonically-name bnodes during 
JSON-LD normalization. Below are some preliminary thoughts on an 
algorithm for resolving bnode-equality so that JSON-LD normalizers 
produce the same names for bnodes. There is currently no proof for 
correctness and there are likely some errors. It's also fairly complex, 
there are probably some ways to simplify it. It should also be noted 
that it isn't a streaming algorithm (rather it is in-memory or 
database-backed), but hopefully this doesn't matter for nearly all 
normalization use cases.

Anyway, I'm throwing this out there to to get the ball rolling... so 
here are the changes to JSON-LD normalization to support named blank 
nodes (bnodes):

Run the current normalization algorithm without throwing errors for 
named bnodes.

Now, to normalize bnode names:

1. Collect all bnodes in a list, named and unnamed.
2. For each unnamed bnode, generate a unique name for it.
3. Collect all direct references to all bnodes in a map of lists.

At this point, there should be unique names for every bnode and a map of 
all direct references to each one. But now the names need to be 
canonicalized.

Since bnode names cannot be used to identify them, we must now 
canonically sort them based on information other than their names and 
other than other bnodes. We begin by sorting them according to all of 
their attributes (properties and references) that do not involve other 
bnodes. If we find equivalent bnodes after comparing all of their 
properties, we will have to do some deeper graph traversals to resolve 
equality. Once all the bnodes are sorted, they will be renamed in their 
sort order which will give them normalized names. Once they have been 
given canonical names, they can be merged into a list with the other 
named nodes in the graph and normalization will be complete.

The algorithm for deterministically sorting bnodes based on their 
relationships to named-nodes and literals is called "BasicSort" here:

BasicSort Algorithm (when comparing two bnodes):
1. Compare the number of properties. The bnode with fewer properties 
comes first.
2. Compare the number of references (list length in the reference map). 
The bnode with fewer references comes first.
3. Compare alphabetically sorted-properties. The bnode with the 
alphabetically-first property comes first.
4. For each property, compare object values.
5. The bnode with a string object comes first.
5.a. The bnode with the alphabetically-first string comes first.
6. The bnode with a @literal comes first.
6.a. The bnode with the alphabetically-first @literal comes first.
6.b. The bnode with the alphabetically-first @datatype comes first.
6.c. The bnode with a @language comes first.
6.d. The bnode with the alphabetically-first @language comes first.
7. The bnode with a non-bnode @iri comes first.
7.a. The bnode with the alphabetically-first non-bnode @iri comes first.
8. Compare alphabetically-sorted non-bnode references. The bnode with 
the alphabetically-first reference comes first.

(End BasicSort Algorithm)

At this point the comparisons require deeper traversing of the graph to 
resolve bnode equality. We build "paths" that attempt to differentiate 
one node from another. A "path" is built by traversing up a bnode's 
reference tree. If a cycle is encountered, the path stops. There can be 
no repeats of the same node in a path and the bnode the path is for 
cannot itself appear in the path. The path also stops when a node is 
encountered that has no references to it. It may be possible to assign a 
number as a weight for nodes for doing these calculations. That might 
make the algorithm speedier or easier to implement, but it is an 
implementation detail that isn't required in the spec.

Each bnode will have a set of paths. Each path can be described as a 
list of node names and properties. For instance, if a bnode was 
referenced by "_:a6" at property "property1", the first element in that 
path would be "(:_a6, property1)". This continues up the reference tree 
for every reference. An example path might be:

"(:_a6,property1),(http://foo,property7),(:_a3,property3)"

(This might be a good target for algorithm complexity reduction, when 
comparing two bnodes we shouldn't need to generate all of the paths for 
a given node reference that is the same between the two bnodes being 
compared ... since those paths would be identical).

Once the paths have been generated for two bnodes, the sets can be 
compared against one another to further differentiate the bnodes.

How to compare path sets:

1. Match up equal paths and remove them from further comparison.
2. The bnode with the fewest number of shortest paths comes first (if 
the bnodes each have 2 paths of length 2, check paths of length 3, and 
so on).
3. Compare each path against each other path of the same length.
3.a. The bnode with the path with the IRI node comes first.
3.b. The bnode with the path with the alphabetically-first IRI comes first.
3.c. The bnode with the path with the alphabetically-first property 
comes first.
3.d. The bnode with the path with the lesser bnode from the BasicSort 
Algorithm comes first.

At this point, any two bnodes that have not been differentiated from 
each other are equivalent. All equal-bnodes are placed into "buckets" 
and will be named last. The buckets should be arranged in sort-order, 
with the lesser bnode buckets first in the list.

All of the bnodes that have been differientiated should have new 
canonical names generated and assigned to them, replacing their old 
names in all of their references. When a new canonical name is 
generated, if it already exists on one of the bnodes to be named, the 
existing bnode must be given a temporary name from a different namespace 
from the canonical namespace.

Next, all of the equivalent bnodes must be named.

For each bucket of equal-bnodes, for each bnode, find the 
alphabetically-first and unique named-node reference. In other words, 
get the alphabetically-first name of the direct reference for each of 
the bnodes in the bucket. The reference must be to an IRI-node or a 
bnode that has already been named. If a unique name is found that is 
alphabetically first amongst all bnodes in the bucket, then the bnode 
with that reference is assigned the next canonical name and removed from 
the bucket. If a unique name is not found, repeat until all of the 
references for all of the bnodes in the bucket have been exhausted or a 
bnode has been canonically-named. If all of the references have been 
exhausted, then assign the next name to the first bnode in the bucket. 
This is OK to do because there should be no remaining deterministic ways 
to pick the next name, which means that any choice made is the same as 
any other. Therefore, any JSON-LD normalizer will make "the same" 
decision. Repeat until the bucket is empty. Repeat until the bucket-list 
is empty.

Now put all of the nodes in a sorted array. Normalization is complete.

The specific algorithm for generating the exact canonical names (eg: 
_:a1, _:a2, etc.) is not mentioned here but should be trivial.

-Dave

-- 
Dave Longley
CTO
Digital Bazaar, Inc.
Received on Sunday, 29 May 2011 21:58:32 UTC