Re: deterministic naming of blank nodes from Sandro Hawke on 2014-10-07 (semantic-web@w3.org from October 2014)

From: Sandro Hawke <sandro@w3.org>
Date: Tue, 07 Oct 2014 13:38:28 -0400
To: Tim Berners-Lee <timbl@w3.org>
CC: David Booth <david@dbooth.org>, SW-forum Web <semantic-web@w3.org>
Message-ID: <54342514.7090406@w3.org>
On 09/26/2014 06:20 PM, Tim Berners-Lee wrote:
>
> On 2014-09 -24, at 00:37, Sandro Hawke <sandro@w3.org 
> <mailto:sandro@w3.org>> wrote:
>
>> On 09/23/2014 05:34 PM, David Booth wrote:
>>> BTW, I want to draw attention to the fact that the need for defining 
>>> an RDF-specific PATCH operation is *entirely* a consequence of RDF's 
>>> allowance of unrestricted blank nodes.  I do not think that blank 
>>> nodes should be eliminated from RDF, but I am convinced that RDF's 
>>> current treatment of blank nodes is a significant design flaw that 
>>> has *many* downstream effects that are ultimately detrimental to 
>>> RDF's adoption. The need for RDF PATCH is another example.
>>>
>>> Unix/linux diff and patch utilities have been used successfully for 
>>> *decades*, with many other information representations.  Imagine how 
>>> simple and easy it would be if we could just generate canonical 
>>> N-Triples and use standard diff and patch against that!  But we 
>>> can't, because blank nodes are unstable across RDF serializations 
>>> and no canonical way to generate them has been standardized.  This, 
>>> in turn is because generating a canonical form of unrestricted RDF 
>>> is a hard problem (NP-complete), because of blank nodes.  The 
>>> problem is *much* easier if the use of blank nodes is limited to 
>>> *implicit* blank nodes -- those that are generated implicitly by the 
>>> use of square brackets "[]" or parentheses "()" for lists in Turtle 
>>> -- and indeed this is the vast majority of blank node use.  (See 
>>> "Everything You Always Wanted to Know About Blank Nodes", by Hogan, 
>>> Arenas, Mallea and Polleres:
>>> http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/387 
>>> )
>>>
>>> For this reason the use of "Well Behaved RDF" was proposed, which 
>>> limits the use of blank nodes to implicit blank nodes:
>>> http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf
>>> I don't know if Well Behaved RDF is the best solution to this 
>>> problem. Maybe someone will come along with a better idea.  But I am 
>>> convinced that the current treatment of blank nodes in RDF is a 
>>> serious problem that we should fix in order to make RDF simpler to 
>>> use, understand and adopt.
>>>
>>> I really don't like having to make excuses for RDF when it cannot be 
>>> used in a similar way as nearly every other information 
>>> representation -- such as being able to easily compare two RDF 
>>> documents for "equality" (which in RDF becomes a complex graph 
>>> isomorphism problem) or generate a simple diff and patch -- all 
>>> because of RDF's unrestricted treatment of blank nodes.
>>>
>>> Clearly this is not something that the Linked Data Platform working 
>>> group can fix.  But I think it is important to bring it to people's 
>>> attention, in the hope that we will someday soon have the creativity 
>>> and gumption to fix it.
>>>
>>> I should also acknowledge that there are some who do not feel that 
>>> RDF's treatment of blank nodes is a problem. Fine.  It may not be a 
>>> problem to an elite few who are well steeped in the subtleties of 
>>> description logic, model theory and RDF Semantics, and who don't 
>>> mind having to use RDF-specific tools instead of generic tools.  But 
>>> having tried for over 10 years to explain RDF to a wider audience of 
>>> regular software developers, I am convinced that subtleties like 
>>> RDF's treatment of blank nodes *are* a problem to a much wider 
>>> audience of *potential* RDF users who would be more inclined to 
>>> adopt RDF if it didn't have complexities like this.  As it is they 
>>> are more likely to stick with JSON or XML, whose complexities they 
>>> already know, rather than venturing into the obscure and esoteric 
>>> world of RDF.
>>>
>>> RDF tools are not as mature as those for XML or even JSON, which is 
>>> much younger than RDF.  I believe blank nodes are one specific 
>>> reason they're not.  The fact that we still don't even have a 
>>> simple, standard way to compare RDF documents and compute diffs and 
>>> patches, is a perfect example.
>>>
>>> David
>>>
>>>
>>
>> I agree that it makes sense to have good terminology for graphs that 
>> can be serialized in Turtle without blank node labels, and perhaps to 
>> focus diff on these nice graphs. (When I clicked on your Well-Behaved 
>> RDF link, my PDF viewer remembered I was on page six.  :-)   )
>>
>> How would you name the blank nodes when serializing this in n-triples 
>> or n-quads?   I guess given triple S P O, where O is a blank node, 
>> you'd name O based on the hash of the names of S and P?   Something like
>>
>>     o.asNTriplesTerm = "_:a"+sha1hex(s.asNTriplesTerm+"
>>     "+p.asNTriplesTerm)?
>>
>>
>> Or maybe leave off the NTriples baggage, the "<"...">" and "_:".    
>> Maybe:
>>
>>     o.id = sha1hex(s.id+" "+p.id)
>>
>> You'd have to evaluate these in the right order, but that would be 
>> the order you parsed them from Turtle, so it should be fast and 
>> deterministic.   Hm.   Can we avoid the repeated hashing?   I suppose 
>> so, with some kind of path expression:
>>
>>     Given:  <s1> <p1> [ <p2> ( 1 2 3 [ <p3> ( 11 12 [ <p4> 444 ] ) ] ) ]
>>
>>     The subject of <p4> 444 would have its id determined by hashing
>>     (s1, p1, p2, 4, p3, 3).
>>
>> (I'm treating integer list positions as if they were properties.)   
>> With a little cleverness, caching those intermediate hashes, I think 
>> that could run almost as fast as computing a hash of the turtle file 
>> (ie very fast).
>>
>> Wow, this might work.    Very good idea, David.    Beyond the fact 
>> that it only works on this kind of graph, does anyone see any problem 
>> with it?
>>
>
> Hmmm , suppose a blank node can be identified by more than one path, 
> then the hashed path would not be unique and so you couldn't use them 
> to compare blank nodes

My quick analysis is that every blank node will be identified by exactly 
one forward path, given David Booth's proposed constraint that graphs 
can be represented in Turtle without blank node labels.

Thinking this through:

- Blank nodes only get introduced by [ ] and ( ).

- I think it's clear ( ) is syntactic sugar for [ ] with 
rdf:first/rest/nil.   ( 1 2 ) == [ rdf:first 1; rdf:rest [ rdf:first 2; 
rdf:rest rdf:nil] ] etc.

- With [ ] only allowed to appear in the object position, it's either:
   <s> <p> [ ... ]

in which case the path is (s,p)

- Or it's inside a [ ], like:
   ... [ ...; <p2> [ ... ]; ... ] ...

in which case the path is the path to the outer [ ], with p2 appended.

So, that makes me think there's exactly one path to each blank node in 
this constrained situation.

         -- Sandro

>
>>       -- Sandro
>>
>
Received on Tuesday, 7 October 2014 17:38:36 UTC