- From: Dave Longley <dlongley@digitalbazaar.com>
- Date: Sat, 28 May 2011 14:43:58 -0400
- To: public-linked-json@w3.org
There's been some talk about trying to canonically-name bnodes during JSON-LD normalization. Below are some preliminary thoughts on an algorithm for resolving bnode-equality so that JSON-LD normalizers produce the same names for bnodes. There is currently no proof for correctness and there are likely some errors. It's also fairly complex, there are probably some ways to simplify it. It should also be noted that it isn't a streaming algorithm (rather it is in-memory or database-backed), but hopefully this doesn't matter for nearly all normalization use cases. Anyway, I'm throwing this out there to to get the ball rolling... so here are the changes to JSON-LD normalization to support named blank nodes (bnodes): Run the current normalization algorithm without throwing errors for named bnodes. Now, to normalize bnode names: 1. Collect all bnodes in a list, named and unnamed. 2. For each unnamed bnode, generate a unique name for it. 3. Collect all direct references to all bnodes in a map of lists. At this point, there should be unique names for every bnode and a map of all direct references to each one. But now the names need to be canonicalized. Since bnode names cannot be used to identify them, we must now canonically sort them based on information other than their names and other than other bnodes. We begin by sorting them according to all of their attributes (properties and references) that do not involve other bnodes. If we find equivalent bnodes after comparing all of their properties, we will have to do some deeper graph traversals to resolve equality. Once all the bnodes are sorted, they will be renamed in their sort order which will give them normalized names. Once they have been given canonical names, they can be merged into a list with the other named nodes in the graph and normalization will be complete. The algorithm for deterministically sorting bnodes based on their relationships to named-nodes and literals is called "BasicSort" here: BasicSort Algorithm (when comparing two bnodes): 1. Compare the number of properties. The bnode with fewer properties comes first. 2. Compare the number of references (list length in the reference map). The bnode with fewer references comes first. 3. Compare alphabetically sorted-properties. The bnode with the alphabetically-first property comes first. 4. For each property, compare object values. 5. The bnode with a string object comes first. 5.a. The bnode with the alphabetically-first string comes first. 6. The bnode with a @literal comes first. 6.a. The bnode with the alphabetically-first @literal comes first. 6.b. The bnode with the alphabetically-first @datatype comes first. 6.c. The bnode with a @language comes first. 6.d. The bnode with the alphabetically-first @language comes first. 7. The bnode with a non-bnode @iri comes first. 7.a. The bnode with the alphabetically-first non-bnode @iri comes first. 8. Compare alphabetically-sorted non-bnode references. The bnode with the alphabetically-first reference comes first. (End BasicSort Algorithm) At this point the comparisons require deeper traversing of the graph to resolve bnode equality. We build "paths" that attempt to differentiate one node from another. A "path" is built by traversing up a bnode's reference tree. If a cycle is encountered, the path stops. There can be no repeats of the same node in a path and the bnode the path is for cannot itself appear in the path. The path also stops when a node is encountered that has no references to it. It may be possible to assign a number as a weight for nodes for doing these calculations. That might make the algorithm speedier or easier to implement, but it is an implementation detail that isn't required in the spec. Each bnode will have a set of paths. Each path can be described as a list of node names and properties. For instance, if a bnode was referenced by "_:a6" at property "property1", the first element in that path would be "(:_a6, property1)". This continues up the reference tree for every reference. An example path might be: "(:_a6,property1),(http://foo,property7),(:_a3,property3)" (This might be a good target for algorithm complexity reduction, when comparing two bnodes we shouldn't need to generate all of the paths for a given node reference that is the same between the two bnodes being compared ... since those paths would be identical). Once the paths have been generated for two bnodes, the sets can be compared against one another to further differentiate the bnodes. How to compare path sets: 1. Match up equal paths and remove them from further comparison. 2. The bnode with the fewest number of shortest paths comes first (if the bnodes each have 2 paths of length 2, check paths of length 3, and so on). 3. Compare each path against each other path of the same length. 3.a. The bnode with the path with the IRI node comes first. 3.b. The bnode with the path with the alphabetically-first IRI comes first. 3.c. The bnode with the path with the alphabetically-first property comes first. 3.d. The bnode with the path with the lesser bnode from the BasicSort Algorithm comes first. At this point, any two bnodes that have not been differentiated from each other are equivalent. All equal-bnodes are placed into "buckets" and will be named last. The buckets should be arranged in sort-order, with the lesser bnode buckets first in the list. All of the bnodes that have been differientiated should have new canonical names generated and assigned to them, replacing their old names in all of their references. When a new canonical name is generated, if it already exists on one of the bnodes to be named, the existing bnode must be given a temporary name from a different namespace from the canonical namespace. Next, all of the equivalent bnodes must be named. For each bucket of equal-bnodes, for each bnode, find the alphabetically-first and unique named-node reference. In other words, get the alphabetically-first name of the direct reference for each of the bnodes in the bucket. The reference must be to an IRI-node or a bnode that has already been named. If a unique name is found that is alphabetically first amongst all bnodes in the bucket, then the bnode with that reference is assigned the next canonical name and removed from the bucket. If a unique name is not found, repeat until all of the references for all of the bnodes in the bucket have been exhausted or a bnode has been canonically-named. If all of the references have been exhausted, then assign the next name to the first bnode in the bucket. This is OK to do because there should be no remaining deterministic ways to pick the next name, which means that any choice made is the same as any other. Therefore, any JSON-LD normalizer will make "the same" decision. Repeat until the bucket is empty. Repeat until the bucket-list is empty. Now put all of the nodes in a sorted array. Normalization is complete. The specific algorithm for generating the exact canonical names (eg: _:a1, _:a2, etc.) is not mentioned here but should be trivial. -Dave -- Dave Longley CTO Digital Bazaar, Inc.
Received on Sunday, 29 May 2011 21:58:32 UTC