Re: deterministic naming of blank nodes from Alan Ruttenberg on 2014-10-01 (semantic-web@w3.org from October 2014)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Wed, 1 Oct 2014 11:03:51 -0400
To: David Booth <david@dbooth.org>
Cc: Tim Berners-Lee <timbl@w3.org>, Sandro Hawke <sandro@w3.org>, SW-forum Web <semantic-web@w3.org>, Pat Hayes <phayes@ihmc.us>
Message-ID: <CAFKQJ8mH197XDYDydiHfWD+8ZOhASETCh5qe6XoG3LNPTqYxwA@mail.gmail.com>
Wouldn't deterministic naming of blank nodes mean there was an effective
solution to the graph isomorphism problem, which is known to at least be
outside P?

http://mathworld.wolfram.com/GraphIsomorphism.html

-Alan


On Fri, Sep 26, 2014 at 6:48 PM, David Booth <david@dbooth.org> wrote:

> On 09/26/2014 06:20 PM, Tim Berners-Lee wrote:
>
>>
>> On 2014-09 -24, at 00:37, Sandro Hawke <sandro@w3.org
>> <mailto:sandro@w3.org>> wrote:
>>
>>  On 09/23/2014 05:34 PM, David Booth wrote:
>>>
>>>> BTW, I want to draw attention to the fact that the need for defining
>>>> an RDF-specific PATCH operation is *entirely* a consequence of RDF's
>>>> allowance of unrestricted blank nodes.  I do not think that blank
>>>> nodes should be eliminated from RDF, but I am convinced that RDF's
>>>> current treatment of blank nodes is a significant design flaw that
>>>> has *many* downstream effects that are ultimately detrimental to
>>>> RDF's adoption. The need for RDF PATCH is another example.
>>>>
>>>> Unix/linux diff and patch utilities have been used successfully for
>>>> *decades*, with many other information representations. Imagine how
>>>> simple and easy it would be if we could just generate canonical
>>>> N-Triples and use standard diff and patch against that! But we can't,
>>>> because blank nodes are unstable across RDF serializations and no
>>>> canonical way to generate them has been standardized.  This, in turn
>>>> is because generating a canonical form of unrestricted RDF is a hard
>>>> problem (NP-complete), because of blank nodes.  The problem is *much*
>>>> easier if the use of blank nodes is limited to *implicit* blank nodes
>>>> -- those that are generated implicitly by the use of square brackets
>>>> "[]" or parentheses "()" for lists in Turtle -- and indeed this is
>>>> the vast majority of blank node use.  (See "Everything You Always
>>>> Wanted to Know About Blank Nodes", by Hogan, Arenas, Mallea and
>>>> Polleres:
>>>> http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/
>>>> 387
>>>> )
>>>>
>>>> For this reason the use of "Well Behaved RDF" was proposed, which
>>>> limits the use of blank nodes to implicit blank nodes:
>>>> http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf
>>>> I don't know if Well Behaved RDF is the best solution to this
>>>> problem. Maybe someone will come along with a better idea.  But I am
>>>> convinced that the current treatment of blank nodes in RDF is a
>>>> serious problem that we should fix in order to make RDF simpler to
>>>> use, understand and adopt.
>>>>
>>>> I really don't like having to make excuses for RDF when it cannot be
>>>> used in a similar way as nearly every other information
>>>> representation -- such as being able to easily compare two RDF
>>>> documents for "equality" (which in RDF becomes a complex graph
>>>> isomorphism problem) or generate a simple diff and patch -- all
>>>> because of RDF's unrestricted treatment of blank nodes.
>>>>
>>>> Clearly this is not something that the Linked Data Platform working
>>>> group can fix.  But I think it is important to bring it to people's
>>>> attention, in the hope that we will someday soon have the creativity
>>>> and gumption to fix it.
>>>>
>>>> I should also acknowledge that there are some who do not feel that
>>>> RDF's treatment of blank nodes is a problem.  Fine.  It may not be a
>>>> problem to an elite few who are well steeped in the subtleties of
>>>> description logic, model theory and RDF Semantics, and who don't mind
>>>> having to use RDF-specific tools instead of generic tools.  But
>>>> having tried for over 10 years to explain RDF to a wider audience of
>>>> regular software developers, I am convinced that subtleties like
>>>> RDF's treatment of blank nodes *are* a problem to a much wider
>>>> audience of *potential* RDF users who would be more inclined to adopt
>>>> RDF if it didn't have complexities like this. As it is they are more
>>>> likely to stick with JSON or XML, whose complexities they already
>>>> know, rather than venturing into the obscure and esoteric world of RDF.
>>>>
>>>> RDF tools are not as mature as those for XML or even JSON, which is
>>>> much younger than RDF.  I believe blank nodes are one specific reason
>>>> they're not.  The fact that we still don't even have a simple,
>>>> standard way to compare RDF documents and compute diffs and patches,
>>>> is a perfect example.
>>>>
>>>> David
>>>>
>>>>
>>>>
>>> I agree that it makes sense to have good terminology for graphs that
>>> can be serialized in Turtle without blank node labels, and perhaps to
>>> focus diff on these nice graphs.    (When I clicked on your
>>> Well-Behaved RDF link, my PDF viewer remembered I was on page six. :-)
>>>  )
>>>
>>> How would you name the blank nodes when serializing this in n-triples
>>> or n-quads?   I guess given triple S P O, where O is a blank node,
>>> you'd name O based on the hash of the names of S and P?   Something like
>>>
>>>     o.asNTriplesTerm = "_:a"+sha1hex(s.asNTriplesTerm+"
>>>     "+p.asNTriplesTerm)?
>>>
>>>
>>> Or maybe leave off the NTriples baggage, the "<"...">" and "_:".
>>> Maybe:
>>>
>>>     o.id = sha1hex(s.id+" "+p.id)
>>>
>>> You'd have to evaluate these in the right order, but that would be the
>>> order you parsed them from Turtle, so it should be fast and
>>> deterministic.   Hm.   Can we avoid the repeated hashing?   I suppose
>>> so, with some kind of path expression:
>>>
>>>     Given:  <s1> <p1> [ <p2> ( 1 2 3 [ <p3> ( 11 12 [ <p4> 444 ] ) ] ) ]
>>>
>>>     The subject of <p4> 444 would have its id determined by hashing
>>>     (s1, p1, p2, 4, p3, 3).
>>>
>>> (I'm treating integer list positions as if they were properties.) With
>>> a little cleverness, caching those intermediate hashes, I think that
>>> could run almost as fast as computing a hash of the turtle file (ie
>>> very fast).
>>>
>>> Wow, this might work.    Very good idea, David.    Beyond the fact
>>> that it only works on this kind of graph, does anyone see any problem
>>> with it?
>>>
>>>
>> Hmmm , suppose a blank node can be identified by more than one path,
>> then the hashed path would not be unique and so you couldn't use them to
>> compare blank nodes
>>
>
> If the algorithm deterministically chose which path to use in generating
> the hash -- say, the first alphabetically -- then it would always generate
> the same hashes for the same (isomorphic) graphs.  Even if the graph
> changed slightly, such an algorithm could often choose the same path, so it
> would be fairly robust to minor changes in the graph, which is a very
> desirable property.  And in the cases where it did choose a different path,
> you would end up with two URIs for what should have been considered the
> same node, and owl:sameAs could be used to indicate their equivalence, once
> their equivalence is detected through application-specific means.
>
> David
>
>
Received on Wednesday, 1 October 2014 15:04:51 UTC