Re: deterministic naming of blank nodes from David Booth on 2014-10-01 (semantic-web@w3.org from October 2014)

From: David Booth <david@dbooth.org>
Date: Wed, 01 Oct 2014 17:09:46 -0400
To: Alan Ruttenberg <alanruttenberg@gmail.com>
CC: Tim Berners-Lee <timbl@w3.org>, Sandro Hawke <sandro@w3.org>, SW-forum Web <semantic-web@w3.org>, Pat Hayes <phayes@ihmc.us>
Message-ID: <542C6D9A.9020505@dbooth.org>
Hi Alan,

Yes, in the general case of *unrestricted* blank node use, deterministic 
naming of blank nodes would be as difficult as the graph isomorphism 
problem.  That is exactly the motivation behind "Well Behaved RDF"[1]: 
to modestly restrict the use of blank nodes to avoid that complexity 
problem, while retaining the most important and common uses of blank 
nodes.  Doing so would change the complexity from a graph isomorphism 
problem to a tree problem, enabling deterministic naming of blank nodes 
to be practical.

David

1. Well Behaved RDF:
http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf

On 10/01/2014 11:03 AM, Alan Ruttenberg wrote:
> Wouldn't deterministic naming of blank nodes mean there was an effective
> solution to the graph isomorphism problem, which is known to at least be
> outside P?
>
> http://mathworld.wolfram.com/GraphIsomorphism.html
>
> -Alan
>
>
> On Fri, Sep 26, 2014 at 6:48 PM, David Booth <david@dbooth.org
> <mailto:david@dbooth.org>> wrote:
>
>     On 09/26/2014 06:20 PM, Tim Berners-Lee wrote:
>
>
>         On 2014-09 -24, at 00:37, Sandro Hawke <sandro@w3.org
>         <mailto:sandro@w3.org>
>         <mailto:sandro@w3.org <mailto:sandro@w3.org>>> wrote:
>
>             On 09/23/2014 05:34 PM, David Booth wrote:
>
>                 BTW, I want to draw attention to the fact that the need
>                 for defining
>                 an RDF-specific PATCH operation is *entirely* a
>                 consequence of RDF's
>                 allowance of unrestricted blank nodes.  I do not think
>                 that blank
>                 nodes should be eliminated from RDF, but I am convinced
>                 that RDF's
>                 current treatment of blank nodes is a significant design
>                 flaw that
>                 has *many* downstream effects that are ultimately
>                 detrimental to
>                 RDF's adoption. The need for RDF PATCH is another example.
>
>                 Unix/linux diff and patch utilities have been used
>                 successfully for
>                 *decades*, with many other information representations.
>                 Imagine how
>                 simple and easy it would be if we could just generate
>                 canonical
>                 N-Triples and use standard diff and patch against that!
>                 But we can't,
>                 because blank nodes are unstable across RDF
>                 serializations and no
>                 canonical way to generate them has been standardized.
>                 This, in turn
>                 is because generating a canonical form of unrestricted
>                 RDF is a hard
>                 problem (NP-complete), because of blank nodes.  The
>                 problem is *much*
>                 easier if the use of blank nodes is limited to
>                 *implicit* blank nodes
>                 -- those that are generated implicitly by the use of
>                 square brackets
>                 "[]" or parentheses "()" for lists in Turtle -- and
>                 indeed this is
>                 the vast majority of blank node use.  (See "Everything
>                 You Always
>                 Wanted to Know About Blank Nodes", by Hogan, Arenas,
>                 Mallea and
>                 Polleres:
>                 http://www.__websemanticsjournal.org/index.__php/ps/article/viewFile/365/__387
>                 <http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/387>
>                 )
>
>                 For this reason the use of "Well Behaved RDF" was
>                 proposed, which
>                 limits the use of blank nodes to implicit blank nodes:
>                 http://dbooth.org/2013/well-__behaved-rdf/Booth-well-__behaved-rdf.pdf
>                 <http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf>
>                 I don't know if Well Behaved RDF is the best solution to
>                 this
>                 problem. Maybe someone will come along with a better
>                 idea.  But I am
>                 convinced that the current treatment of blank nodes in
>                 RDF is a
>                 serious problem that we should fix in order to make RDF
>                 simpler to
>                 use, understand and adopt.
>
>                 I really don't like having to make excuses for RDF when
>                 it cannot be
>                 used in a similar way as nearly every other information
>                 representation -- such as being able to easily compare
>                 two RDF
>                 documents for "equality" (which in RDF becomes a complex
>                 graph
>                 isomorphism problem) or generate a simple diff and patch
>                 -- all
>                 because of RDF's unrestricted treatment of blank nodes.
>
>                 Clearly this is not something that the Linked Data
>                 Platform working
>                 group can fix.  But I think it is important to bring it
>                 to people's
>                 attention, in the hope that we will someday soon have
>                 the creativity
>                 and gumption to fix it.
>
>                 I should also acknowledge that there are some who do not
>                 feel that
>                 RDF's treatment of blank nodes is a problem.  Fine.  It
>                 may not be a
>                 problem to an elite few who are well steeped in the
>                 subtleties of
>                 description logic, model theory and RDF Semantics, and
>                 who don't mind
>                 having to use RDF-specific tools instead of generic
>                 tools.  But
>                 having tried for over 10 years to explain RDF to a wider
>                 audience of
>                 regular software developers, I am convinced that
>                 subtleties like
>                 RDF's treatment of blank nodes *are* a problem to a much
>                 wider
>                 audience of *potential* RDF users who would be more
>                 inclined to adopt
>                 RDF if it didn't have complexities like this. As it is
>                 they are more
>                 likely to stick with JSON or XML, whose complexities
>                 they already
>                 know, rather than venturing into the obscure and
>                 esoteric world of RDF.
>
>                 RDF tools are not as mature as those for XML or even
>                 JSON, which is
>                 much younger than RDF.  I believe blank nodes are one
>                 specific reason
>                 they're not.  The fact that we still don't even have a
>                 simple,
>                 standard way to compare RDF documents and compute diffs
>                 and patches,
>                 is a perfect example.
>
>                 David
>
>
>
>             I agree that it makes sense to have good terminology for
>             graphs that
>             can be serialized in Turtle without blank node labels, and
>             perhaps to
>             focus diff on these nice graphs.    (When I clicked on your
>             Well-Behaved RDF link, my PDF viewer remembered I was on
>             page six. :-)   )
>
>             How would you name the blank nodes when serializing this in
>             n-triples
>             or n-quads?   I guess given triple S P O, where O is a blank
>             node,
>             you'd name O based on the hash of the names of S and P?
>               Something like
>
>                  o.asNTriplesTerm = "_:a"+sha1hex(s.__asNTriplesTerm+"
>                  "+p.asNTriplesTerm)?
>
>
>             Or maybe leave off the NTriples baggage, the "<"...">" and
>             "_:".    Maybe:
>
>             o.id <http://o.id> = sha1hex(s.id <http://s.id>+" "+p.id
>             <http://p.id>)
>
>             You'd have to evaluate these in the right order, but that
>             would be the
>             order you parsed them from Turtle, so it should be fast and
>             deterministic.   Hm.   Can we avoid the repeated hashing?
>               I suppose
>             so, with some kind of path expression:
>
>                  Given:  <s1> <p1> [ <p2> ( 1 2 3 [ <p3> ( 11 12 [ <p4>
>             444 ] ) ] ) ]
>
>                  The subject of <p4> 444 would have its id determined by
>             hashing
>                  (s1, p1, p2, 4, p3, 3).
>
>             (I'm treating integer list positions as if they were
>             properties.) With
>             a little cleverness, caching those intermediate hashes, I
>             think that
>             could run almost as fast as computing a hash of the turtle
>             file (ie
>             very fast).
>
>             Wow, this might work.    Very good idea, David.    Beyond
>             the fact
>             that it only works on this kind of graph, does anyone see
>             any problem
>             with it?
>
>
>         Hmmm , suppose a blank node can be identified by more than one path,
>         then the hashed path would not be unique and so you couldn't use
>         them to
>         compare blank nodes
>
>
>     If the algorithm deterministically chose which path to use in
>     generating the hash -- say, the first alphabetically -- then it
>     would always generate the same hashes for the same (isomorphic)
>     graphs.  Even if the graph changed slightly, such an algorithm could
>     often choose the same path, so it would be fairly robust to minor
>     changes in the graph, which is a very desirable property.  And in
>     the cases where it did choose a different path, you would end up
>     with two URIs for what should have been considered the same node,
>     and owl:sameAs could be used to indicate their equivalence, once
>     their equivalence is detected through application-specific means.
>
>     David
>
>
Received on Wednesday, 1 October 2014 21:10:16 UTC