Re: deterministic naming of blank nodes from Tobias Kuhn on 2014-10-02 (semantic-web@w3.org from October 2014)

From: Tobias Kuhn <kuhntobias@gmail.com>
Date: Thu, 02 Oct 2014 09:23:23 +0200
To: Alan Ruttenberg <alanruttenberg@gmail.com>, David Booth <david@dbooth.org>, Tim Berners-Lee <timbl@w3.org>, Sandro Hawke <sandro@w3.org>, SW-forum Web <semantic-web@w3.org>, Pat Hayes <phayes@ihmc.us>
CC: Michel Dumontier <michel.dumontier@gmail.com>
Message-ID: <542CFD6B.4050209@gmail.com>
Dear all,

I was wondering whether simply providing a standardized (but not fully 
deterministic) method for skolemization could solve the current 
practical problems. I have the feeling that relying on skolemization 
(involving some kind of hashes) is more elegant and more practical than 
defining new subsets of RDF or adding auxiliary triples to graphs (as 
some other canonicalization approaches do). People could then be 
encouraged to skolemize all their blank nodes before publishing a file, 
and applications dependent on canonicalization could accept only RDF 
documents without blank nodes (but possibly including skolem-URIs). We 
proposed something of that kind at this year's ESWC [1]. I am not 
claiming that our concrete proposal is general enough for the problems 
discussed here, but it might be an interesting direction.

Tobias

[1] Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts 
for Linked Data. http://arxiv.org/pdf/1401.5775.pdf


On 02.10.2014 00:03, Alan Ruttenberg wrote:
>
>
> On Wed, Oct 1, 2014 at 5:09 PM, David Booth <david@dbooth.org
> <mailto:david@dbooth.org>> wrote:
>
>     Hi Alan,
>
>     Yes, in the general case of *unrestricted* blank node use,
>     deterministic naming of blank nodes would be as difficult as the
>     graph isomorphism problem.  That is exactly the motivation behind
>     "Well Behaved RDF"[1]: to modestly restrict the use of blank nodes
>     to avoid that complexity problem, while retaining the most important
>     and common uses of blank nodes.  Doing so would change the
>     complexity from a graph isomorphism problem to a tree problem,
>     enabling deterministic naming of blank nodes to be practical.
>
>
> That may be true, but it is hard for me to see how any benefit this
> could bring would outweigh the absolute pain in the ass it would be for
> everyone to change their RDF stacks. I'll scan back through the message
> to see why you think solving this problem is so important, but I can't
> come up with a reason off the top of my head.
>
> -Alan
>
>
>     David
>
>     1. Well Behaved RDF:
>     http://dbooth.org/2013/well-__behaved-rdf/Booth-well-__behaved-rdf.pdf
>     <http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf>
>
>     On 10/01/2014 11:03 AM, Alan Ruttenberg wrote:
>
>         Wouldn't deterministic naming of blank nodes mean there was an
>         effective
>         solution to the graph isomorphism problem, which is known to at
>         least be
>         outside P?
>
>         http://mathworld.wolfram.com/__GraphIsomorphism.html
>         <http://mathworld.wolfram.com/GraphIsomorphism.html>
>
>         -Alan
>
>
>         On Fri, Sep 26, 2014 at 6:48 PM, David Booth <david@dbooth.org
>         <mailto:david@dbooth.org>
>         <mailto:david@dbooth.org <mailto:david@dbooth.org>>> wrote:
>
>              On 09/26/2014 06:20 PM, Tim Berners-Lee wrote:
>
>
>                  On 2014-09 -24, at 00:37, Sandro Hawke <sandro@w3.org
>         <mailto:sandro@w3.org>
>                  <mailto:sandro@w3.org <mailto:sandro@w3.org>>
>                  <mailto:sandro@w3.org <mailto:sandro@w3.org>
>         <mailto:sandro@w3.org <mailto:sandro@w3.org>>>> wrote:
>
>                      On 09/23/2014 05:34 PM, David Booth wrote:
>
>                          BTW, I want to draw attention to the fact that
>         the need
>                          for defining
>                          an RDF-specific PATCH operation is *entirely* a
>                          consequence of RDF's
>                          allowance of unrestricted blank nodes.  I do
>         not think
>                          that blank
>                          nodes should be eliminated from RDF, but I am
>         convinced
>                          that RDF's
>                          current treatment of blank nodes is a
>         significant design
>                          flaw that
>                          has *many* downstream effects that are ultimately
>                          detrimental to
>                          RDF's adoption. The need for RDF PATCH is
>         another example.
>
>                          Unix/linux diff and patch utilities have been used
>                          successfully for
>                          *decades*, with many other information
>         representations.
>                          Imagine how
>                          simple and easy it would be if we could just
>         generate
>                          canonical
>                          N-Triples and use standard diff and patch
>         against that!
>                          But we can't,
>                          because blank nodes are unstable across RDF
>                          serializations and no
>                          canonical way to generate them has been
>         standardized.
>                          This, in turn
>                          is because generating a canonical form of
>         unrestricted
>                          RDF is a hard
>                          problem (NP-complete), because of blank nodes.  The
>                          problem is *much*
>                          easier if the use of blank nodes is limited to
>                          *implicit* blank nodes
>                          -- those that are generated implicitly by the
>         use of
>                          square brackets
>                          "[]" or parentheses "()" for lists in Turtle -- and
>                          indeed this is
>                          the vast majority of blank node use.  (See
>         "Everything
>                          You Always
>                          Wanted to Know About Blank Nodes", by Hogan,
>         Arenas,
>                          Mallea and
>                          Polleres:
>         http://www.__websemanticsjourn__al.org/index.__php/ps/article/__viewFile/365/__387
>         <http://websemanticsjournal.org/index.__php/ps/article/viewFile/365/__387>
>
>         <http://www.__websemanticsjournal.org/index.__php/ps/article/viewFile/365/__387
>         <http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/387>>
>                          )
>
>                          For this reason the use of "Well Behaved RDF" was
>                          proposed, which
>                          limits the use of blank nodes to implicit blank
>         nodes:
>         http://dbooth.org/2013/well-____behaved-rdf/Booth-well-____behaved-rdf.pdf
>         <http://dbooth.org/2013/well-__behaved-rdf/Booth-well-__behaved-rdf.pdf>
>
>
>         <http://dbooth.org/2013/well-__behaved-rdf/Booth-well-__behaved-rdf.pdf
>         <http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf>>
>                          I don't know if Well Behaved RDF is the best
>         solution to
>                          this
>                          problem. Maybe someone will come along with a
>         better
>                          idea.  But I am
>                          convinced that the current treatment of blank
>         nodes in
>                          RDF is a
>                          serious problem that we should fix in order to
>         make RDF
>                          simpler to
>                          use, understand and adopt.
>
>                          I really don't like having to make excuses for
>         RDF when
>                          it cannot be
>                          used in a similar way as nearly every other
>         information
>                          representation -- such as being able to easily
>         compare
>                          two RDF
>                          documents for "equality" (which in RDF becomes
>         a complex
>                          graph
>                          isomorphism problem) or generate a simple diff
>         and patch
>                          -- all
>                          because of RDF's unrestricted treatment of
>         blank nodes.
>
>                          Clearly this is not something that the Linked Data
>                          Platform working
>                          group can fix.  But I think it is important to
>         bring it
>                          to people's
>                          attention, in the hope that we will someday
>         soon have
>                          the creativity
>                          and gumption to fix it.
>
>                          I should also acknowledge that there are some
>         who do not
>                          feel that
>                          RDF's treatment of blank nodes is a problem.
>         Fine.  It
>                          may not be a
>                          problem to an elite few who are well steeped in the
>                          subtleties of
>                          description logic, model theory and RDF
>         Semantics, and
>                          who don't mind
>                          having to use RDF-specific tools instead of generic
>                          tools.  But
>                          having tried for over 10 years to explain RDF
>         to a wider
>                          audience of
>                          regular software developers, I am convinced that
>                          subtleties like
>                          RDF's treatment of blank nodes *are* a problem
>         to a much
>                          wider
>                          audience of *potential* RDF users who would be more
>                          inclined to adopt
>                          RDF if it didn't have complexities like this.
>         As it is
>                          they are more
>                          likely to stick with JSON or XML, whose
>         complexities
>                          they already
>                          know, rather than venturing into the obscure and
>                          esoteric world of RDF.
>
>                          RDF tools are not as mature as those for XML or
>         even
>                          JSON, which is
>                          much younger than RDF.  I believe blank nodes
>         are one
>                          specific reason
>                          they're not.  The fact that we still don't even
>         have a
>                          simple,
>                          standard way to compare RDF documents and
>         compute diffs
>                          and patches,
>                          is a perfect example.
>
>                          David
>
>
>
>                      I agree that it makes sense to have good
>         terminology for
>                      graphs that
>                      can be serialized in Turtle without blank node
>         labels, and
>                      perhaps to
>                      focus diff on these nice graphs.    (When I clicked
>         on your
>                      Well-Behaved RDF link, my PDF viewer remembered I
>         was on
>                      page six. :-)   )
>
>                      How would you name the blank nodes when serializing
>         this in
>                      n-triples
>                      or n-quads?   I guess given triple S P O, where O
>         is a blank
>                      node,
>                      you'd name O based on the hash of the names of S and P?
>                        Something like
>
>                           o.asNTriplesTerm =
>         "_:a"+sha1hex(s.____asNTriplesTerm+"
>                           "+p.asNTriplesTerm)?
>
>
>                      Or maybe leave off the NTriples baggage, the
>         "<"...">" and
>                      "_:".    Maybe:
>
>         o.id <http://o.id> <http://o.id> = sha1hex(s.id <http://s.id>
>         <http://s.id>+" "+p.id <http://p.id>
>                      <http://p.id>)
>
>
>                      You'd have to evaluate these in the right order,
>         but that
>                      would be the
>                      order you parsed them from Turtle, so it should be
>         fast and
>                      deterministic.   Hm.   Can we avoid the repeated
>         hashing?
>                        I suppose
>                      so, with some kind of path expression:
>
>                           Given:  <s1> <p1> [ <p2> ( 1 2 3 [ <p3> ( 11
>         12 [ <p4>
>                      444 ] ) ] ) ]
>
>                           The subject of <p4> 444 would have its id
>         determined by
>                      hashing
>                           (s1, p1, p2, 4, p3, 3).
>
>                      (I'm treating integer list positions as if they were
>                      properties.) With
>                      a little cleverness, caching those intermediate
>         hashes, I
>                      think that
>                      could run almost as fast as computing a hash of the
>         turtle
>                      file (ie
>                      very fast).
>
>                      Wow, this might work.    Very good idea, David.
>         Beyond
>                      the fact
>                      that it only works on this kind of graph, does
>         anyone see
>                      any problem
>                      with it?
>
>
>                  Hmmm , suppose a blank node can be identified by more
>         than one path,
>                  then the hashed path would not be unique and so you
>         couldn't use
>                  them to
>                  compare blank nodes
>
>
>              If the algorithm deterministically chose which path to use in
>              generating the hash -- say, the first alphabetically -- then it
>              would always generate the same hashes for the same (isomorphic)
>              graphs.  Even if the graph changed slightly, such an
>         algorithm could
>              often choose the same path, so it would be fairly robust to
>         minor
>              changes in the graph, which is a very desirable property.
>         And in
>              the cases where it did choose a different path, you would
>         end up
>              with two URIs for what should have been considered the same
>         node,
>              and owl:sameAs could be used to indicate their equivalence,
>         once
>              their equivalence is detected through application-specific
>         means.
>
>              David
>
>
>
Received on Thursday, 2 October 2014 10:43:30 UTC