- From: Alan Ruttenberg <alanruttenberg@gmail.com>
- Date: Wed, 1 Oct 2014 11:03:51 -0400
- To: David Booth <david@dbooth.org>
- Cc: Tim Berners-Lee <timbl@w3.org>, Sandro Hawke <sandro@w3.org>, SW-forum Web <semantic-web@w3.org>, Pat Hayes <phayes@ihmc.us>
- Message-ID: <CAFKQJ8mH197XDYDydiHfWD+8ZOhASETCh5qe6XoG3LNPTqYxwA@mail.gmail.com>
Wouldn't deterministic naming of blank nodes mean there was an effective solution to the graph isomorphism problem, which is known to at least be outside P? http://mathworld.wolfram.com/GraphIsomorphism.html -Alan On Fri, Sep 26, 2014 at 6:48 PM, David Booth <david@dbooth.org> wrote: > On 09/26/2014 06:20 PM, Tim Berners-Lee wrote: > >> >> On 2014-09 -24, at 00:37, Sandro Hawke <sandro@w3.org >> <mailto:sandro@w3.org>> wrote: >> >> On 09/23/2014 05:34 PM, David Booth wrote: >>> >>>> BTW, I want to draw attention to the fact that the need for defining >>>> an RDF-specific PATCH operation is *entirely* a consequence of RDF's >>>> allowance of unrestricted blank nodes. I do not think that blank >>>> nodes should be eliminated from RDF, but I am convinced that RDF's >>>> current treatment of blank nodes is a significant design flaw that >>>> has *many* downstream effects that are ultimately detrimental to >>>> RDF's adoption. The need for RDF PATCH is another example. >>>> >>>> Unix/linux diff and patch utilities have been used successfully for >>>> *decades*, with many other information representations. Imagine how >>>> simple and easy it would be if we could just generate canonical >>>> N-Triples and use standard diff and patch against that! But we can't, >>>> because blank nodes are unstable across RDF serializations and no >>>> canonical way to generate them has been standardized. This, in turn >>>> is because generating a canonical form of unrestricted RDF is a hard >>>> problem (NP-complete), because of blank nodes. The problem is *much* >>>> easier if the use of blank nodes is limited to *implicit* blank nodes >>>> -- those that are generated implicitly by the use of square brackets >>>> "[]" or parentheses "()" for lists in Turtle -- and indeed this is >>>> the vast majority of blank node use. (See "Everything You Always >>>> Wanted to Know About Blank Nodes", by Hogan, Arenas, Mallea and >>>> Polleres: >>>> http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/ >>>> 387 >>>> ) >>>> >>>> For this reason the use of "Well Behaved RDF" was proposed, which >>>> limits the use of blank nodes to implicit blank nodes: >>>> http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf >>>> I don't know if Well Behaved RDF is the best solution to this >>>> problem. Maybe someone will come along with a better idea. But I am >>>> convinced that the current treatment of blank nodes in RDF is a >>>> serious problem that we should fix in order to make RDF simpler to >>>> use, understand and adopt. >>>> >>>> I really don't like having to make excuses for RDF when it cannot be >>>> used in a similar way as nearly every other information >>>> representation -- such as being able to easily compare two RDF >>>> documents for "equality" (which in RDF becomes a complex graph >>>> isomorphism problem) or generate a simple diff and patch -- all >>>> because of RDF's unrestricted treatment of blank nodes. >>>> >>>> Clearly this is not something that the Linked Data Platform working >>>> group can fix. But I think it is important to bring it to people's >>>> attention, in the hope that we will someday soon have the creativity >>>> and gumption to fix it. >>>> >>>> I should also acknowledge that there are some who do not feel that >>>> RDF's treatment of blank nodes is a problem. Fine. It may not be a >>>> problem to an elite few who are well steeped in the subtleties of >>>> description logic, model theory and RDF Semantics, and who don't mind >>>> having to use RDF-specific tools instead of generic tools. But >>>> having tried for over 10 years to explain RDF to a wider audience of >>>> regular software developers, I am convinced that subtleties like >>>> RDF's treatment of blank nodes *are* a problem to a much wider >>>> audience of *potential* RDF users who would be more inclined to adopt >>>> RDF if it didn't have complexities like this. As it is they are more >>>> likely to stick with JSON or XML, whose complexities they already >>>> know, rather than venturing into the obscure and esoteric world of RDF. >>>> >>>> RDF tools are not as mature as those for XML or even JSON, which is >>>> much younger than RDF. I believe blank nodes are one specific reason >>>> they're not. The fact that we still don't even have a simple, >>>> standard way to compare RDF documents and compute diffs and patches, >>>> is a perfect example. >>>> >>>> David >>>> >>>> >>>> >>> I agree that it makes sense to have good terminology for graphs that >>> can be serialized in Turtle without blank node labels, and perhaps to >>> focus diff on these nice graphs. (When I clicked on your >>> Well-Behaved RDF link, my PDF viewer remembered I was on page six. :-) >>> ) >>> >>> How would you name the blank nodes when serializing this in n-triples >>> or n-quads? I guess given triple S P O, where O is a blank node, >>> you'd name O based on the hash of the names of S and P? Something like >>> >>> o.asNTriplesTerm = "_:a"+sha1hex(s.asNTriplesTerm+" >>> "+p.asNTriplesTerm)? >>> >>> >>> Or maybe leave off the NTriples baggage, the "<"...">" and "_:". >>> Maybe: >>> >>> o.id = sha1hex(s.id+" "+p.id) >>> >>> You'd have to evaluate these in the right order, but that would be the >>> order you parsed them from Turtle, so it should be fast and >>> deterministic. Hm. Can we avoid the repeated hashing? I suppose >>> so, with some kind of path expression: >>> >>> Given: <s1> <p1> [ <p2> ( 1 2 3 [ <p3> ( 11 12 [ <p4> 444 ] ) ] ) ] >>> >>> The subject of <p4> 444 would have its id determined by hashing >>> (s1, p1, p2, 4, p3, 3). >>> >>> (I'm treating integer list positions as if they were properties.) With >>> a little cleverness, caching those intermediate hashes, I think that >>> could run almost as fast as computing a hash of the turtle file (ie >>> very fast). >>> >>> Wow, this might work. Very good idea, David. Beyond the fact >>> that it only works on this kind of graph, does anyone see any problem >>> with it? >>> >>> >> Hmmm , suppose a blank node can be identified by more than one path, >> then the hashed path would not be unique and so you couldn't use them to >> compare blank nodes >> > > If the algorithm deterministically chose which path to use in generating > the hash -- say, the first alphabetically -- then it would always generate > the same hashes for the same (isomorphic) graphs. Even if the graph > changed slightly, such an algorithm could often choose the same path, so it > would be fairly robust to minor changes in the graph, which is a very > desirable property. And in the cases where it did choose a different path, > you would end up with two URIs for what should have been considered the > same node, and owl:sameAs could be used to indicate their equivalence, once > their equivalence is detected through application-specific means. > > David > >
Received on Wednesday, 1 October 2014 15:04:51 UTC