- From: David Booth <david@dbooth.org>
- Date: Thu, 25 Sep 2014 15:03:01 -0400
- To: Sandro Hawke <sandro@w3.org>, semantic-web@w3.org
On 09/23/2014 07:37 PM, Sandro Hawke wrote: > On 09/23/2014 05:34 PM, David Booth wrote: >> BTW, I want to draw attention to the fact that the need for defining >> an RDF-specific PATCH operation is *entirely* a consequence of RDF's >> allowance of unrestricted blank nodes. I do not think that blank >> nodes should be eliminated from RDF, but I am convinced that RDF's >> current treatment of blank nodes is a significant design flaw that has >> *many* downstream effects that are ultimately detrimental to RDF's >> adoption. The need for RDF PATCH is another example. >> >> Unix/linux diff and patch utilities have been used successfully for >> *decades*, with many other information representations. Imagine how >> simple and easy it would be if we could just generate canonical >> N-Triples and use standard diff and patch against that! But we can't, >> because blank nodes are unstable across RDF serializations and no >> canonical way to generate them has been standardized. This, in turn >> is because generating a canonical form of unrestricted RDF is a hard >> problem (NP-complete), because of blank nodes. The problem is *much* >> easier if the use of blank nodes is limited to *implicit* blank nodes >> -- those that are generated implicitly by the use of square brackets >> "[]" or parentheses "()" for lists in Turtle -- and indeed this is the >> vast majority of blank node use. (See "Everything You Always Wanted >> to Know About Blank Nodes", by Hogan, Arenas, Mallea and Polleres: >> http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/387 >> ) >> >> For this reason the use of "Well Behaved RDF" was proposed, which >> limits the use of blank nodes to implicit blank nodes: >> http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf >> I don't know if Well Behaved RDF is the best solution to this problem. >> Maybe someone will come along with a better idea. But I am convinced >> that the current treatment of blank nodes in RDF is a serious problem >> that we should fix in order to make RDF simpler to use, understand and >> adopt. >> >> I really don't like having to make excuses for RDF when it cannot be >> used in a similar way as nearly every other information representation >> -- such as being able to easily compare two RDF documents for >> "equality" (which in RDF becomes a complex graph isomorphism problem) >> or generate a simple diff and patch -- all because of RDF's >> unrestricted treatment of blank nodes. >> >> Clearly this is not something that the Linked Data Platform working >> group can fix. But I think it is important to bring it to people's >> attention, in the hope that we will someday soon have the creativity >> and gumption to fix it. >> >> I should also acknowledge that there are some who do not feel that >> RDF's treatment of blank nodes is a problem. Fine. It may not be a >> problem to an elite few who are well steeped in the subtleties of >> description logic, model theory and RDF Semantics, and who don't mind >> having to use RDF-specific tools instead of generic tools. But having >> tried for over 10 years to explain RDF to a wider audience of regular >> software developers, I am convinced that subtleties like RDF's >> treatment of blank nodes *are* a problem to a much wider audience of >> *potential* RDF users who would be more inclined to adopt RDF if it >> didn't have complexities like this. As it is they are more likely to >> stick with JSON or XML, whose complexities they already know, rather >> than venturing into the obscure and esoteric world of RDF. >> >> RDF tools are not as mature as those for XML or even JSON, which is >> much younger than RDF. I believe blank nodes are one specific reason >> they're not. The fact that we still don't even have a simple, >> standard way to compare RDF documents and compute diffs and patches, >> is a perfect example. >> >> David >> >> > > I agree that it makes sense to have good terminology for graphs that can > be serialized in Turtle without blank node labels, and perhaps to focus > diff on these nice graphs. (When I clicked on your Well-Behaved RDF > link, my PDF viewer remembered I was on page six. :-) ) > > How would you name the blank nodes when serializing this in n-triples or > n-quads? If you consider the graph that is formed when a blank node has an arc to another blank node -- the blank node graph -- then a Well Behaved RDF graph is a forest of blank node trees. Deterministic labeling of blank nodes in each tree can be very fast -- basically a bottom-up process that propagates up through the tree. The paper by Hogan, Arenas, Mallea and Polleres discusses this further: http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/387 > I guess given triple S P O, where O is a blank node, you'd > name O based on the hash of the names of S and P? Something like > > o.asNTriplesTerm = "_:a"+sha1hex(s.asNTriplesTerm+" "+p.asNTriplesTerm)? > > > Or maybe leave off the NTriples baggage, the "<"...">" and "_:". Maybe: > > o.id = sha1hex(s.id+" "+p.id) > > You'd have to evaluate these in the right order, but that would be the > order you parsed them from Turtle, so it should be fast and > deterministic. Hm. Can we avoid the repeated hashing? I suppose > so, with some kind of path expression: > > Given: <s1> <p1> [ <p2> ( 1 2 3 [ <p3> ( 11 12 [ <p4> 444 ] ) ] ) ] > > The subject of <p4> 444 would have its id determined by hashing (s1, > p1, p2, 4, p3, 3). > > (I'm treating integer list positions as if they were properties.) With a > little cleverness, caching those intermediate hashes, I think that could > run almost as fast as computing a hash of the turtle file (ie very fast). I think what you're describing is essentially a bottom-up recursive labeling of the blank nodes in each blank node tree. This makes sense, and is how I have always envisioned doing it. Using hashed names for the generated URIs instead of using a concatenation of path components is an optimization that shortens the generated URIs. Someone else -- I think somewhere in Europe -- described such an algorithm in detail a few years ago in a blog post, but I don't remember who it was and I cannot seem to find the link. Maybe someone else would remember it and supply a link? I also see this paper, which describes an algorithm for hashing and canonicalizing RDF graphs in N3, though I have not read it: http://www.it.uc3m.es/jaf/papers/2010/jcss-hash-n3/fisteus-hash-n3.pdf David > > Wow, this might work. Very good idea, David. Beyond the fact that > it only works on this kind of graph, does anyone see any problem with it? > > -- Sandro >
Received on Thursday, 25 September 2014 19:03:33 UTC