Re: Well Behaved RDF - Taming Blank Nodes, etc. from David Booth on 2012-12-19 (semantic-web@w3.org from December 2012)

From: David Booth <david@dbooth.org>
Date: Wed, 19 Dec 2012 14:09:32 -0500
To: Pat Hayes <phayes@ihmc.us>
Cc: semantic-web@w3.org
Message-ID: <1355944172.2229.46100.camel@dbooth-laptop>
Hi Pat,

Sorry, I slightly botched that explanation, because I left out a
critical element.  :(  I'll clarify below, because there really is an
important practical point that I was trying to make.

On Tue, 2012-12-18 at 21:27 -0800, Pat Hayes wrote:
> David, this is a silly argument:
> 
> On Dec 18, 2012, at 9:02 PM, David Booth wrote:
> 
> > On Tue, 2012-12-18 at 23:06 +0700, Ivan Shmakov wrote:
> > [ . . . ]
> > But aside from that, there is still a bigger problem.  If you have
> > out-of-band information about the blank nodes (e.g., perhaps you knew
> > how they were generated, and you know that certain properties are
> > inverse functional -- unique keys for them), then you may be able to
> > merge blank nodes as you describe.  But if you don't, then it isn't so
> > easy to determine whether those blank nodes represent the same entity.
> > Do _:b1 and _:b2 denote the same dog in the following RDF?
> > 
> >  _:b1 a :Dog ; :color :black .
> > 
> >  _:b2 a :Dog ; :color :black .
> > 
> > Without without having out-of-band information, and without knowing what
> > other statements may have been made about _:b1 and _:b2 in the graph, it
> > is *impossible* to know.  
> 
> Indeed. But if the second _:b2 had been _:b1, then you would know they
> were the same. 
> 
> > And even when you do know what other
> > statements have been made, it is still a difficult graph problem.  It is
> > basically the problem of determining whether the graph is "lean"
> > http://www.w3.org/TR/rdf-mt/#deflean
> > which is an NP-complete problem:
> > http://www.dcc.uchile.cl/~cgutierr/papers/revisedRDF.pdf 
> > 
> > In contrast, if I had:
> > 
> >  :d1 a :Dog ; :color :black .
> > 
> >  :d1 a :Dog ; :color :black .
> > 
> > then it is trivially easy to know that those statements are about the
> > same dog -- it's the same URI! --
> 
> Just like it would be the same bnode if you had used the same bnodeID.
> But if your second :d1 were :d2, just as your second bnodeID was
> different from your first, then (just as with the bnodes) you would
> not know if these two URIs co-referred or not. 
> 
> URIs and bnodes are EXACTLY similar in this regard. There is no
> "contrast" here to be in.

What I neglected to mention in my example above, was *why* there was
another line describing _:b2, and *why* there were two lines
describing :d1.  (Sorry!)  

Certainly, if those _:b1 and _:b2 lines had been generated at once, in a
single graph, then _:b1 could have been used twice like this:

  _:b1 a :Dog ; :color :black .
  _:b1 a :Dog ; :color :black .

and it would have been just as trivially easy to know that they refer to
the same dog as if URIs had been used like this:

  :d1 a :Dog ; :color :black . 
  :d1 a :Dog ; :color :black .

What I neglected to stipulate was that the *reason* those "extra" lines
occur is that they have been added from different source graphs, or from
the same source graph being loaded twice.  When the statement:

  _:d a :Dog ; :color :black .

is read in twice to a triple store (i.e., as two graph loads), it
necessarily effectively becomes:

  _:b1 a :Dog ; :color :black . 
  _:b2 a :Dog ; :color :black .

because blank node labels are unstable.  Whereas when the statement:

  :d1 a :Dog ; :color :black .

is read in twice (as two graph loads) it is effectively still:

  :d1 a :Dog ; :color :black . 
  :d1 a :Dog ; :color :black .

I have run into this problem quite a lot when blank nodes are used in
the common idiom of representing n-ary relations, which is a convenient
and natural use for them.  For example, I might indicate that Alice at a
particular time had a temperature of 101.3F (omitting units for
brevity):

  :alice :temperature [ 
    rdf:value 101.3 ; :time "2005-02-28T09:35:27Z"^^xsd:dateTime ] .

But when this is loaded into a triple store, it is treated as:

  :alice :temperature _:b1 
  _:b1 rdf:value 101.3 ; :time "2005-02-28T09:35:27Z"^^xsd:dateTime .

And if it is loaded twice, I get "duplicates" that appear to be
different measurements (i.e., non-lean graph):

  :alice :temperature _:b1 
  _:b1 rdf:value 101.3 ; :time "2005-02-28T09:35:27Z"^^xsd:dateTime .
  :alice :temperature _:b2 
  _:b2 rdf:value 101.3 ; :time "2005-02-28T09:35:27Z"^^xsd:dateTime .

which causes real practical difficulties when I query.  In particular,
my queries become more complex because I have to specifically filter out
measurements that have the same :time value.  Not fun, and not
efficient, especially when this same pattern appears in multiple places
in the data.  

On the other hand, if I had use a URI instead, such as this:

  :alice :temperature :t-alice-20050228T093527Z .
  :t-alice-20050228T093527Z rdf:value 101.3 ; 
       :time "2005-02-28T09:35:27Z"^^xsd:dateTime .

then I have no such problem.  Note that in this example I have
specifically minted this URI in such a way as to achieve two goals by
encoding the measurement time into the URI: (a) avoid accidentally
minting the same URI for different measurements; and (b) avoid
accidentally minting *different* URIs for the *same* measurement.  This
is less convenient to write than using a blank node, but it
substantially reduces downstream difficulties as the data is merged and
queried.  

In other words, by carefully minting the URI this way, the original
author of the temperature measurement went to more trouble than it would
have been to merely use a blank node.  But it results in a significant
downstream payoff when the RDF data is used later.  This was Sandro's
point when he said that "blank nodes are a convenience for the content
provider and a burden on the content consumer":
http://lists.w3.org/Archives/Public/semantic-web/2011Mar/0068.html 



-- 
David Booth, Ph.D.
http://dbooth.org/

Opinions expressed herein are those of the author and do not necessarily
reflect those of his employer.
Received on Wednesday, 19 December 2012 19:10:02 UTC