- From: Gregg Kellogg <gregg@kellogg-assoc.com>
- Date: Mon, 30 May 2011 14:51:09 -0400
- To: Dave Longley <dlongley@digitalbazaar.com>
- CC: "public-linked-json@w3.org" <public-linked-json@w3.org>
I think we need to be sure we understand the use cases for JSON-LD normalization, and indeed normalization requirements for any RDF serialization. I'm worried that this algorithm is far too complex for what is intended to be a simple format. (No reflection on the author, BNodes are just inherently problematic for this purpose). The question is, is there a legitimate need to sign an arbitrary graph? A graph author can easily avoid BNode normalization problems by either skolemizing BNodes using internal identifiers, or simply author using URIs for nodes having multiple references. If I were to want to normalize a remote graph, is it necessary to ensure that a third party will normalize the same way? I could always re-create the normalization myself, if I were to internalize the remote graph first, and be able to count on my own internal identifiers. As I understand it, skolemization rules do not require that two different entities skolemize in the same way, only that a skolemized BNode can be recognized by the generating entity for the purpose of performing updates. Gregg Kellogg Sent from my iPad On May 29, 2011, at 14:59, "Dave Longley" <dlongley@digitalbazaar.com> wrote: > There's been some talk about trying to canonically-name bnodes during > JSON-LD normalization. Below are some preliminary thoughts on an > algorithm for resolving bnode-equality so that JSON-LD normalizers > produce the same names for bnodes. There is currently no proof for > correctness and there are likely some errors. It's also fairly complex, > there are probably some ways to simplify it. It should also be noted > that it isn't a streaming algorithm (rather it is in-memory or > database-backed), but hopefully this doesn't matter for nearly all > normalization use cases. > > Anyway, I'm throwing this out there to to get the ball rolling... so > here are the changes to JSON-LD normalization to support named blank > nodes (bnodes): > > Run the current normalization algorithm without throwing errors for > named bnodes. > > Now, to normalize bnode names: > > 1. Collect all bnodes in a list, named and unnamed. > 2. For each unnamed bnode, generate a unique name for it. > 3. Collect all direct references to all bnodes in a map of lists. > > At this point, there should be unique names for every bnode and a map of > all direct references to each one. But now the names need to be > canonicalized. > > Since bnode names cannot be used to identify them, we must now > canonically sort them based on information other than their names and > other than other bnodes. We begin by sorting them according to all of > their attributes (properties and references) that do not involve other > bnodes. If we find equivalent bnodes after comparing all of their > properties, we will have to do some deeper graph traversals to resolve > equality. Once all the bnodes are sorted, they will be renamed in their > sort order which will give them normalized names. Once they have been > given canonical names, they can be merged into a list with the other > named nodes in the graph and normalization will be complete. > > The algorithm for deterministically sorting bnodes based on their > relationships to named-nodes and literals is called "BasicSort" here: > > BasicSort Algorithm (when comparing two bnodes): > 1. Compare the number of properties. The bnode with fewer properties > comes first. > 2. Compare the number of references (list length in the reference map). > The bnode with fewer references comes first. > 3. Compare alphabetically sorted-properties. The bnode with the > alphabetically-first property comes first. > 4. For each property, compare object values. > 5. The bnode with a string object comes first. > 5.a. The bnode with the alphabetically-first string comes first. > 6. The bnode with a @literal comes first. > 6.a. The bnode with the alphabetically-first @literal comes first. > 6.b. The bnode with the alphabetically-first @datatype comes first. > 6.c. The bnode with a @language comes first. > 6.d. The bnode with the alphabetically-first @language comes first. > 7. The bnode with a non-bnode @iri comes first. > 7.a. The bnode with the alphabetically-first non-bnode @iri comes first. > 8. Compare alphabetically-sorted non-bnode references. The bnode with > the alphabetically-first reference comes first. > > (End BasicSort Algorithm) > > At this point the comparisons require deeper traversing of the graph to > resolve bnode equality. We build "paths" that attempt to differentiate > one node from another. A "path" is built by traversing up a bnode's > reference tree. If a cycle is encountered, the path stops. There can be > no repeats of the same node in a path and the bnode the path is for > cannot itself appear in the path. The path also stops when a node is > encountered that has no references to it. It may be possible to assign a > number as a weight for nodes for doing these calculations. That might > make the algorithm speedier or easier to implement, but it is an > implementation detail that isn't required in the spec. > > Each bnode will have a set of paths. Each path can be described as a > list of node names and properties. For instance, if a bnode was > referenced by "_:a6" at property "property1", the first element in that > path would be "(:_a6, property1)". This continues up the reference tree > for every reference. An example path might be: > > "(:_a6,property1),(http://foo,property7),(:_a3,property3)" > > (This might be a good target for algorithm complexity reduction, when > comparing two bnodes we shouldn't need to generate all of the paths for > a given node reference that is the same between the two bnodes being > compared ... since those paths would be identical). > > Once the paths have been generated for two bnodes, the sets can be > compared against one another to further differentiate the bnodes. > > How to compare path sets: > > 1. Match up equal paths and remove them from further comparison. > 2. The bnode with the fewest number of shortest paths comes first (if > the bnodes each have 2 paths of length 2, check paths of length 3, and > so on). > 3. Compare each path against each other path of the same length. > 3.a. The bnode with the path with the IRI node comes first. > 3.b. The bnode with the path with the alphabetically-first IRI comes first. > 3.c. The bnode with the path with the alphabetically-first property > comes first. > 3.d. The bnode with the path with the lesser bnode from the BasicSort > Algorithm comes first. > > At this point, any two bnodes that have not been differentiated from > each other are equivalent. All equal-bnodes are placed into "buckets" > and will be named last. The buckets should be arranged in sort-order, > with the lesser bnode buckets first in the list. > > All of the bnodes that have been differientiated should have new > canonical names generated and assigned to them, replacing their old > names in all of their references. When a new canonical name is > generated, if it already exists on one of the bnodes to be named, the > existing bnode must be given a temporary name from a different namespace > from the canonical namespace. > > Next, all of the equivalent bnodes must be named. > > For each bucket of equal-bnodes, for each bnode, find the > alphabetically-first and unique named-node reference. In other words, > get the alphabetically-first name of the direct reference for each of > the bnodes in the bucket. The reference must be to an IRI-node or a > bnode that has already been named. If a unique name is found that is > alphabetically first amongst all bnodes in the bucket, then the bnode > with that reference is assigned the next canonical name and removed from > the bucket. If a unique name is not found, repeat until all of the > references for all of the bnodes in the bucket have been exhausted or a > bnode has been canonically-named. If all of the references have been > exhausted, then assign the next name to the first bnode in the bucket. > This is OK to do because there should be no remaining deterministic ways > to pick the next name, which means that any choice made is the same as > any other. Therefore, any JSON-LD normalizer will make "the same" > decision. Repeat until the bucket is empty. Repeat until the bucket-list > is empty. > > Now put all of the nodes in a sorted array. Normalization is complete. > > The specific algorithm for generating the exact canonical names (eg: > _:a1, _:a2, etc.) is not mentioned here but should be trivial. > > -Dave > > -- > Dave Longley > CTO > Digital Bazaar, Inc. > > >
Received on Monday, 30 May 2011 18:51:46 UTC