- From: David Booth <david@dbooth.org>
- Date: Wed, 01 Oct 2014 17:09:46 -0400
- To: Alan Ruttenberg <alanruttenberg@gmail.com>
- CC: Tim Berners-Lee <timbl@w3.org>, Sandro Hawke <sandro@w3.org>, SW-forum Web <semantic-web@w3.org>, Pat Hayes <phayes@ihmc.us>
Hi Alan, Yes, in the general case of *unrestricted* blank node use, deterministic naming of blank nodes would be as difficult as the graph isomorphism problem. That is exactly the motivation behind "Well Behaved RDF"[1]: to modestly restrict the use of blank nodes to avoid that complexity problem, while retaining the most important and common uses of blank nodes. Doing so would change the complexity from a graph isomorphism problem to a tree problem, enabling deterministic naming of blank nodes to be practical. David 1. Well Behaved RDF: http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf On 10/01/2014 11:03 AM, Alan Ruttenberg wrote: > Wouldn't deterministic naming of blank nodes mean there was an effective > solution to the graph isomorphism problem, which is known to at least be > outside P? > > http://mathworld.wolfram.com/GraphIsomorphism.html > > -Alan > > > On Fri, Sep 26, 2014 at 6:48 PM, David Booth <david@dbooth.org > <mailto:david@dbooth.org>> wrote: > > On 09/26/2014 06:20 PM, Tim Berners-Lee wrote: > > > On 2014-09 -24, at 00:37, Sandro Hawke <sandro@w3.org > <mailto:sandro@w3.org> > <mailto:sandro@w3.org <mailto:sandro@w3.org>>> wrote: > > On 09/23/2014 05:34 PM, David Booth wrote: > > BTW, I want to draw attention to the fact that the need > for defining > an RDF-specific PATCH operation is *entirely* a > consequence of RDF's > allowance of unrestricted blank nodes. I do not think > that blank > nodes should be eliminated from RDF, but I am convinced > that RDF's > current treatment of blank nodes is a significant design > flaw that > has *many* downstream effects that are ultimately > detrimental to > RDF's adoption. The need for RDF PATCH is another example. > > Unix/linux diff and patch utilities have been used > successfully for > *decades*, with many other information representations. > Imagine how > simple and easy it would be if we could just generate > canonical > N-Triples and use standard diff and patch against that! > But we can't, > because blank nodes are unstable across RDF > serializations and no > canonical way to generate them has been standardized. > This, in turn > is because generating a canonical form of unrestricted > RDF is a hard > problem (NP-complete), because of blank nodes. The > problem is *much* > easier if the use of blank nodes is limited to > *implicit* blank nodes > -- those that are generated implicitly by the use of > square brackets > "[]" or parentheses "()" for lists in Turtle -- and > indeed this is > the vast majority of blank node use. (See "Everything > You Always > Wanted to Know About Blank Nodes", by Hogan, Arenas, > Mallea and > Polleres: > http://www.__websemanticsjournal.org/index.__php/ps/article/viewFile/365/__387 > <http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/387> > ) > > For this reason the use of "Well Behaved RDF" was > proposed, which > limits the use of blank nodes to implicit blank nodes: > http://dbooth.org/2013/well-__behaved-rdf/Booth-well-__behaved-rdf.pdf > <http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf> > I don't know if Well Behaved RDF is the best solution to > this > problem. Maybe someone will come along with a better > idea. But I am > convinced that the current treatment of blank nodes in > RDF is a > serious problem that we should fix in order to make RDF > simpler to > use, understand and adopt. > > I really don't like having to make excuses for RDF when > it cannot be > used in a similar way as nearly every other information > representation -- such as being able to easily compare > two RDF > documents for "equality" (which in RDF becomes a complex > graph > isomorphism problem) or generate a simple diff and patch > -- all > because of RDF's unrestricted treatment of blank nodes. > > Clearly this is not something that the Linked Data > Platform working > group can fix. But I think it is important to bring it > to people's > attention, in the hope that we will someday soon have > the creativity > and gumption to fix it. > > I should also acknowledge that there are some who do not > feel that > RDF's treatment of blank nodes is a problem. Fine. It > may not be a > problem to an elite few who are well steeped in the > subtleties of > description logic, model theory and RDF Semantics, and > who don't mind > having to use RDF-specific tools instead of generic > tools. But > having tried for over 10 years to explain RDF to a wider > audience of > regular software developers, I am convinced that > subtleties like > RDF's treatment of blank nodes *are* a problem to a much > wider > audience of *potential* RDF users who would be more > inclined to adopt > RDF if it didn't have complexities like this. As it is > they are more > likely to stick with JSON or XML, whose complexities > they already > know, rather than venturing into the obscure and > esoteric world of RDF. > > RDF tools are not as mature as those for XML or even > JSON, which is > much younger than RDF. I believe blank nodes are one > specific reason > they're not. The fact that we still don't even have a > simple, > standard way to compare RDF documents and compute diffs > and patches, > is a perfect example. > > David > > > > I agree that it makes sense to have good terminology for > graphs that > can be serialized in Turtle without blank node labels, and > perhaps to > focus diff on these nice graphs. (When I clicked on your > Well-Behaved RDF link, my PDF viewer remembered I was on > page six. :-) ) > > How would you name the blank nodes when serializing this in > n-triples > or n-quads? I guess given triple S P O, where O is a blank > node, > you'd name O based on the hash of the names of S and P? > Something like > > o.asNTriplesTerm = "_:a"+sha1hex(s.__asNTriplesTerm+" > "+p.asNTriplesTerm)? > > > Or maybe leave off the NTriples baggage, the "<"...">" and > "_:". Maybe: > > o.id <http://o.id> = sha1hex(s.id <http://s.id>+" "+p.id > <http://p.id>) > > You'd have to evaluate these in the right order, but that > would be the > order you parsed them from Turtle, so it should be fast and > deterministic. Hm. Can we avoid the repeated hashing? > I suppose > so, with some kind of path expression: > > Given: <s1> <p1> [ <p2> ( 1 2 3 [ <p3> ( 11 12 [ <p4> > 444 ] ) ] ) ] > > The subject of <p4> 444 would have its id determined by > hashing > (s1, p1, p2, 4, p3, 3). > > (I'm treating integer list positions as if they were > properties.) With > a little cleverness, caching those intermediate hashes, I > think that > could run almost as fast as computing a hash of the turtle > file (ie > very fast). > > Wow, this might work. Very good idea, David. Beyond > the fact > that it only works on this kind of graph, does anyone see > any problem > with it? > > > Hmmm , suppose a blank node can be identified by more than one path, > then the hashed path would not be unique and so you couldn't use > them to > compare blank nodes > > > If the algorithm deterministically chose which path to use in > generating the hash -- say, the first alphabetically -- then it > would always generate the same hashes for the same (isomorphic) > graphs. Even if the graph changed slightly, such an algorithm could > often choose the same path, so it would be fairly robust to minor > changes in the graph, which is a very desirable property. And in > the cases where it did choose a different path, you would end up > with two URIs for what should have been considered the same node, > and owl:sameAs could be used to indicate their equivalence, once > their equivalence is detected through application-specific means. > > David > >
Received on Wednesday, 1 October 2014 21:10:16 UTC