Re: Well Behaved RDF - Taming Blank Nodes, etc.

On Tue, 2012-12-18 at 23:06 +0700, Ivan Shmakov wrote:
[ . . . ]
>  But perhaps even more compelling reason to use blank nodes is
>  that instead of introducing owl:sameAs arcs, one may just
>  replace two (or more) distinct blank nodes, — found to be
>  representing the same entity, — with a sole node possessing the
>  union of the properties of such blank nodes.  (Provided we check
>  for, and resolve, any semantic conflicts there are, that is.)

You can do the exact same thing with URIs: You can replace :x, :y and :z
with :x, and give :x the union of the properties that the three of them
had.  Of course, you would later only be able to refer to that node
using the names :x  -- no longer :y or :z -- but with blank nodes you
cannot refer to the node at all from outside the graph anyway, so you
still have not lost anything more than you'd lose by using blank nodes.

But aside from that, there is still a bigger problem.  If you have
out-of-band information about the blank nodes (e.g., perhaps you knew
how they were generated, and you know that certain properties are
inverse functional -- unique keys for them), then you may be able to
merge blank nodes as you describe.  But if you don't, then it isn't so
easy to determine whether those blank nodes represent the same entity.
Do _:b1 and _:b2 denote the same dog in the following RDF?

  _:b1 a :Dog ; :color :black .

  _:b2 a :Dog ; :color :black .

Without without having out-of-band information, and without knowing what
other statements may have been made about _:b1 and _:b2 in the graph, it
is *impossible* to know.  And even when you do know what other
statements have been made, it is still a difficult graph problem.  It is
basically the problem of determining whether the graph is "lean"
http://www.w3.org/TR/rdf-mt/#deflean
which is an NP-complete problem:
http://www.dcc.uchile.cl/~cgutierr/papers/revisedRDF.pdf 

In contrast, if I had:

  :d1 a :Dog ; :color :black .

  :d1 a :Dog ; :color :black .

then it is trivially easy to know that those statements are about the
same dog -- it's the same URI! -- and getting rid of the redundant
statements is trivially easy.

I have found this to be a significant problem in practice when the same
RDF data is read into a triple store more than once.  If it contains
blank nodes then *every* time the graph is loaded I get more duplicate
blank nodes!  (I.e., the graph becomes more and more non-lean, like with
the example above involving _:b1 and _:b2 .)  If I'm not very careful,
those duplicate nodes and triples will cause my queries to return
duplicate results -- wrong counts.  In contrast, when URIs are used
instead, and the graph is read into a triple store more than once, I do
not have that problem.

You might wonder why the same RDF graph would be loaded into a triple
store more than once, but when you are merging lots of data from various
sources, it is very easy for that to happen.  For example, if file X
contains a merge of files A, B and C, and file Y contains a merge of
files C, D and E -- potentially created by some other person or process
or at some other time -- then you'll get C twice when I merge X and Y.


-- 
David Booth, Ph.D.
http://dbooth.org/

Opinions expressed herein are those of the author and do not necessarily
reflect those of his employer.

Received on Wednesday, 19 December 2012 05:02:41 UTC