Re: JSON-LD bnode canonical naming algorithm from Gregg Kellogg on 2011-05-30 (public-linked-json@w3.org from May 2011)

From: Gregg Kellogg <gregg@kellogg-assoc.com>
Date: Mon, 30 May 2011 14:51:09 -0400
To: Dave Longley <dlongley@digitalbazaar.com>
CC: "public-linked-json@w3.org" <public-linked-json@w3.org>
Message-ID: <42805D3E-A1E1-4600-A53E-F0F44565FBC7@kellogg-assoc.com>
I think we need to be sure we understand the use cases for JSON-LD normalization, and indeed normalization requirements for any RDF serialization. I'm worried that this algorithm is far too complex for what is intended to be a simple format. (No reflection on the author, BNodes are just inherently problematic for this purpose).

The question is, is there a legitimate need to sign an arbitrary graph? A graph author can easily avoid BNode normalization problems by either skolemizing BNodes using internal identifiers, or simply author using URIs for nodes having multiple references.

If I were to want to normalize a remote graph, is it necessary to ensure that a third party will normalize the same way? I could always re-create the normalization myself, if I were to internalize the remote graph first, and be able to count on my own internal identifiers.

As I understand it, skolemization rules do not require that two different entities skolemize in the same way, only that a skolemized BNode can be recognized by the generating entity for the purpose of performing updates.

Gregg Kellogg

Sent from my iPad

On May 29, 2011, at 14:59, "Dave Longley" <dlongley@digitalbazaar.com> wrote:

> There's been some talk about trying to canonically-name bnodes during 
> JSON-LD normalization. Below are some preliminary thoughts on an 
> algorithm for resolving bnode-equality so that JSON-LD normalizers 
> produce the same names for bnodes. There is currently no proof for 
> correctness and there are likely some errors. It's also fairly complex, 
> there are probably some ways to simplify it. It should also be noted 
> that it isn't a streaming algorithm (rather it is in-memory or 
> database-backed), but hopefully this doesn't matter for nearly all 
> normalization use cases.
> 
> Anyway, I'm throwing this out there to to get the ball rolling... so 
> here are the changes to JSON-LD normalization to support named blank 
> nodes (bnodes):
> 
> Run the current normalization algorithm without throwing errors for 
> named bnodes.
> 
> Now, to normalize bnode names:
> 
> 1. Collect all bnodes in a list, named and unnamed.
> 2. For each unnamed bnode, generate a unique name for it.
> 3. Collect all direct references to all bnodes in a map of lists.
> 
> At this point, there should be unique names for every bnode and a map of 
> all direct references to each one. But now the names need to be 
> canonicalized.
> 
> Since bnode names cannot be used to identify them, we must now 
> canonically sort them based on information other than their names and 
> other than other bnodes. We begin by sorting them according to all of 
> their attributes (properties and references) that do not involve other 
> bnodes. If we find equivalent bnodes after comparing all of their 
> properties, we will have to do some deeper graph traversals to resolve 
> equality. Once all the bnodes are sorted, they will be renamed in their 
> sort order which will give them normalized names. Once they have been 
> given canonical names, they can be merged into a list with the other 
> named nodes in the graph and normalization will be complete.
> 
> The algorithm for deterministically sorting bnodes based on their 
> relationships to named-nodes and literals is called "BasicSort" here:
> 
> BasicSort Algorithm (when comparing two bnodes):
> 1. Compare the number of properties. The bnode with fewer properties 
> comes first.
> 2. Compare the number of references (list length in the reference map). 
> The bnode with fewer references comes first.
> 3. Compare alphabetically sorted-properties. The bnode with the 
> alphabetically-first property comes first.
> 4. For each property, compare object values.
> 5. The bnode with a string object comes first.
> 5.a. The bnode with the alphabetically-first string comes first.
> 6. The bnode with a @literal comes first.
> 6.a. The bnode with the alphabetically-first @literal comes first.
> 6.b. The bnode with the alphabetically-first @datatype comes first.
> 6.c. The bnode with a @language comes first.
> 6.d. The bnode with the alphabetically-first @language comes first.
> 7. The bnode with a non-bnode @iri comes first.
> 7.a. The bnode with the alphabetically-first non-bnode @iri comes first.
> 8. Compare alphabetically-sorted non-bnode references. The bnode with 
> the alphabetically-first reference comes first.
> 
> (End BasicSort Algorithm)
> 
> At this point the comparisons require deeper traversing of the graph to 
> resolve bnode equality. We build "paths" that attempt to differentiate 
> one node from another. A "path" is built by traversing up a bnode's 
> reference tree. If a cycle is encountered, the path stops. There can be 
> no repeats of the same node in a path and the bnode the path is for 
> cannot itself appear in the path. The path also stops when a node is 
> encountered that has no references to it. It may be possible to assign a 
> number as a weight for nodes for doing these calculations. That might 
> make the algorithm speedier or easier to implement, but it is an 
> implementation detail that isn't required in the spec.
> 
> Each bnode will have a set of paths. Each path can be described as a 
> list of node names and properties. For instance, if a bnode was 
> referenced by "_:a6" at property "property1", the first element in that 
> path would be "(:_a6, property1)". This continues up the reference tree 
> for every reference. An example path might be:
> 
> "(:_a6,property1),(http://foo,property7),(:_a3,property3)"
> 
> (This might be a good target for algorithm complexity reduction, when 
> comparing two bnodes we shouldn't need to generate all of the paths for 
> a given node reference that is the same between the two bnodes being 
> compared ... since those paths would be identical).
> 
> Once the paths have been generated for two bnodes, the sets can be 
> compared against one another to further differentiate the bnodes.
> 
> How to compare path sets:
> 
> 1. Match up equal paths and remove them from further comparison.
> 2. The bnode with the fewest number of shortest paths comes first (if 
> the bnodes each have 2 paths of length 2, check paths of length 3, and 
> so on).
> 3. Compare each path against each other path of the same length.
> 3.a. The bnode with the path with the IRI node comes first.
> 3.b. The bnode with the path with the alphabetically-first IRI comes first.
> 3.c. The bnode with the path with the alphabetically-first property 
> comes first.
> 3.d. The bnode with the path with the lesser bnode from the BasicSort 
> Algorithm comes first.
> 
> At this point, any two bnodes that have not been differentiated from 
> each other are equivalent. All equal-bnodes are placed into "buckets" 
> and will be named last. The buckets should be arranged in sort-order, 
> with the lesser bnode buckets first in the list.
> 
> All of the bnodes that have been differientiated should have new 
> canonical names generated and assigned to them, replacing their old 
> names in all of their references. When a new canonical name is 
> generated, if it already exists on one of the bnodes to be named, the 
> existing bnode must be given a temporary name from a different namespace 
> from the canonical namespace.
> 
> Next, all of the equivalent bnodes must be named.
> 
> For each bucket of equal-bnodes, for each bnode, find the 
> alphabetically-first and unique named-node reference. In other words, 
> get the alphabetically-first name of the direct reference for each of 
> the bnodes in the bucket. The reference must be to an IRI-node or a 
> bnode that has already been named. If a unique name is found that is 
> alphabetically first amongst all bnodes in the bucket, then the bnode 
> with that reference is assigned the next canonical name and removed from 
> the bucket. If a unique name is not found, repeat until all of the 
> references for all of the bnodes in the bucket have been exhausted or a 
> bnode has been canonically-named. If all of the references have been 
> exhausted, then assign the next name to the first bnode in the bucket. 
> This is OK to do because there should be no remaining deterministic ways 
> to pick the next name, which means that any choice made is the same as 
> any other. Therefore, any JSON-LD normalizer will make "the same" 
> decision. Repeat until the bucket is empty. Repeat until the bucket-list 
> is empty.
> 
> Now put all of the nodes in a sorted array. Normalization is complete.
> 
> The specific algorithm for generating the exact canonical names (eg: 
> _:a1, _:a2, etc.) is not mentioned here but should be trivial.
> 
> -Dave
> 
> -- 
> Dave Longley
> CTO
> Digital Bazaar, Inc.
> 
> 
>
Received on Monday, 30 May 2011 18:51:46 UTC