Re: deterministic naming of blank nodes from Alan Ruttenberg on 2014-10-08 (semantic-web@w3.org from October 2014)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Wed, 8 Oct 2014 10:23:25 +0200
To: Sandro Hawke <sandro@w3.org>
Cc: Tim Berners-Lee <timbl@w3.org>, David Booth <david@dbooth.org>, SW-forum Web <semantic-web@w3.org>
Message-ID: <CAFKQJ8=AdXo_sq0kscMbh834WC6ntesX5_6W-dTfvNKKj3Xhng@mail.gmail.com>
Having a unique path isn't enough.

s p [].
s p [].

both paths are (s p). Non-lean, but still
or

s p [p [p a]; p [p b]]

one (s,p) and two (s,p,p)

Shows you need to pay attention to the p,o when the s is [], too.
I think this is the idea of witness in the Hogan, Arenas, Mallea and
Polleres paper.

On Tue, Oct 7, 2014 at 7:38 PM, Sandro Hawke <sandro@w3.org> wrote:

>  On 09/26/2014 06:20 PM, Tim Berners-Lee wrote:
>
>
>  On 2014-09 -24, at 00:37, Sandro Hawke <sandro@w3.org> wrote:
>
>  On 09/23/2014 05:34 PM, David Booth wrote:
>
> BTW, I want to draw attention to the fact that the need for defining an
> RDF-specific PATCH operation is *entirely* a consequence of RDF's allowance
> of unrestricted blank nodes.  I do not think that blank nodes should be
> eliminated from RDF, but I am convinced that RDF's current treatment of
> blank nodes is a significant design flaw that has *many* downstream effects
> that are ultimately detrimental to RDF's adoption. The need for RDF PATCH
> is another example.
>
> Unix/linux diff and patch utilities have been used successfully for
> *decades*, with many other information representations.  Imagine how simple
> and easy it would be if we could just generate canonical N-Triples and use
> standard diff and patch against that!  But we can't, because blank nodes
> are unstable across RDF serializations and no canonical way to generate
> them has been standardized.  This, in turn is because generating a
> canonical form of unrestricted RDF is a hard problem (NP-complete), because
> of blank nodes.  The problem is *much* easier if the use of blank nodes is
> limited to *implicit* blank nodes -- those that are generated implicitly by
> the use of square brackets "[]" or parentheses "()" for lists in Turtle --
> and indeed this is the vast majority of blank node use.  (See "Everything
> You Always Wanted to Know About Blank Nodes", by Hogan, Arenas, Mallea and
> Polleres:
> http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/387
> )
>
> For this reason the use of "Well Behaved RDF" was proposed, which limits
> the use of blank nodes to implicit blank nodes:
> http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf
> I don't know if Well Behaved RDF is the best solution to this problem.
> Maybe someone will come along with a better idea.  But I am convinced that
> the current treatment of blank nodes in RDF is a serious problem that we
> should fix in order to make RDF simpler to use, understand and adopt.
>
> I really don't like having to make excuses for RDF when it cannot be used
> in a similar way as nearly every other information representation -- such
> as being able to easily compare two RDF documents for "equality" (which in
> RDF becomes a complex graph isomorphism problem) or generate a simple diff
> and patch -- all because of RDF's unrestricted treatment of blank nodes.
>
> Clearly this is not something that the Linked Data Platform working group
> can fix.  But I think it is important to bring it to people's attention, in
> the hope that we will someday soon have the creativity and gumption to fix
> it.
>
> I should also acknowledge that there are some who do not feel that RDF's
> treatment of blank nodes is a problem.  Fine.  It may not be a problem to
> an elite few who are well steeped in the subtleties of description logic,
> model theory and RDF Semantics, and who don't mind having to use
> RDF-specific tools instead of generic tools.  But having tried for over 10
> years to explain RDF to a wider audience of regular software developers, I
> am convinced that subtleties like RDF's treatment of blank nodes *are* a
> problem to a much wider audience of *potential* RDF users who would be more
> inclined to adopt RDF if it didn't have complexities like this.  As it is
> they are more likely to stick with JSON or XML, whose complexities they
> already know, rather than venturing into the obscure and esoteric world of
> RDF.
>
> RDF tools are not as mature as those for XML or even JSON, which is much
> younger than RDF.  I believe blank nodes are one specific reason they're
> not.  The fact that we still don't even have a simple, standard way to
> compare RDF documents and compute diffs and patches, is a perfect example.
>
> David
>
>
>
> I agree that it makes sense to have good terminology for graphs that can
> be serialized in Turtle without blank node labels, and perhaps to focus
> diff on these nice graphs.    (When I clicked on your Well-Behaved RDF
> link, my PDF viewer remembered I was on page six.  :-)   )
>
> How would you name the blank nodes when serializing this in n-triples or
> n-quads?   I guess given triple S P O, where O is a blank node, you'd name
> O based on the hash of the names of S and P?   Something like
>
> o.asNTriplesTerm = "_:a"+sha1hex(s.asNTriplesTerm+" "+p.asNTriplesTerm)?
>
>
> Or maybe leave off the NTriples baggage, the "<"...">" and "_:".    Maybe:
>
> o.id = sha1hex(s.id+" "+p.id)
>
> You'd have to evaluate these in the right order, but that would be the
> order you parsed them from Turtle, so it should be fast and
> deterministic.   Hm.   Can we avoid the repeated hashing?   I suppose so,
> with some kind of path expression:
>
> Given:  <s1> <p1> [ <p2> ( 1 2 3 [ <p3> ( 11 12 [ <p4> 444 ] ) ] ) ]
>
> The subject of <p4> 444 would have its id determined by hashing (s1, p1,
> p2, 4, p3, 3).
>
>  (I'm treating integer list positions as if they were properties.)   With
> a little cleverness, caching those intermediate hashes, I think that could
> run almost as fast as computing a hash of the turtle file (ie very fast).
>
> Wow, this might work.    Very good idea, David.    Beyond the fact that it
> only works on this kind of graph, does anyone see any problem with it?
>
>
>  Hmmm , suppose a blank node can be identified by more than one path,
> then the hashed path would not be unique and so you couldn't use them to
> compare blank nodes
>
>
> My quick analysis is that every blank node will be identified by exactly
> one forward path, given David Booth's proposed constraint that graphs can
> be represented in Turtle without blank node labels.
>
> Thinking this through:
>
> - Blank nodes only get introduced by [ ] and ( ).
>
> - I think it's clear ( ) is syntactic sugar for [ ] with
> rdf:first/rest/nil.   ( 1 2 ) == [ rdf:first 1; rdf:rest [ rdf:first 2;
> rdf:rest rdf:nil] ] etc.
>
> - With [ ] only allowed to appear in the object position, it's either:
>   <s> <p> [ ... ]
>
> in which case the path is (s,p)
>
> - Or it's inside a [ ], like:
>   ... [ ...; <p2> [ ... ]; ... ] ...
>
> in which case the path is the path to the outer [ ], with p2 appended.
>
> So, that makes me think there's exactly one path to each blank node in
> this constrained situation.
>
>         -- Sandro
>
>
>        -- Sandro
>
>
>
>
Received on Wednesday, 8 October 2014 08:24:24 UTC