- From: David Booth <david@dbooth.org>
- Date: Wed, 19 Dec 2012 00:02:12 -0500
- To: Ivan Shmakov <oneingray@gmail.com>
- Cc: semantic-web@w3.org
On Tue, 2012-12-18 at 23:06 +0700, Ivan Shmakov wrote: [ . . . ] > But perhaps even more compelling reason to use blank nodes is > that instead of introducing owl:sameAs arcs, one may just > replace two (or more) distinct blank nodes, — found to be > representing the same entity, — with a sole node possessing the > union of the properties of such blank nodes. (Provided we check > for, and resolve, any semantic conflicts there are, that is.) You can do the exact same thing with URIs: You can replace :x, :y and :z with :x, and give :x the union of the properties that the three of them had. Of course, you would later only be able to refer to that node using the names :x -- no longer :y or :z -- but with blank nodes you cannot refer to the node at all from outside the graph anyway, so you still have not lost anything more than you'd lose by using blank nodes. But aside from that, there is still a bigger problem. If you have out-of-band information about the blank nodes (e.g., perhaps you knew how they were generated, and you know that certain properties are inverse functional -- unique keys for them), then you may be able to merge blank nodes as you describe. But if you don't, then it isn't so easy to determine whether those blank nodes represent the same entity. Do _:b1 and _:b2 denote the same dog in the following RDF? _:b1 a :Dog ; :color :black . _:b2 a :Dog ; :color :black . Without without having out-of-band information, and without knowing what other statements may have been made about _:b1 and _:b2 in the graph, it is *impossible* to know. And even when you do know what other statements have been made, it is still a difficult graph problem. It is basically the problem of determining whether the graph is "lean" http://www.w3.org/TR/rdf-mt/#deflean which is an NP-complete problem: http://www.dcc.uchile.cl/~cgutierr/papers/revisedRDF.pdf In contrast, if I had: :d1 a :Dog ; :color :black . :d1 a :Dog ; :color :black . then it is trivially easy to know that those statements are about the same dog -- it's the same URI! -- and getting rid of the redundant statements is trivially easy. I have found this to be a significant problem in practice when the same RDF data is read into a triple store more than once. If it contains blank nodes then *every* time the graph is loaded I get more duplicate blank nodes! (I.e., the graph becomes more and more non-lean, like with the example above involving _:b1 and _:b2 .) If I'm not very careful, those duplicate nodes and triples will cause my queries to return duplicate results -- wrong counts. In contrast, when URIs are used instead, and the graph is read into a triple store more than once, I do not have that problem. You might wonder why the same RDF graph would be loaded into a triple store more than once, but when you are merging lots of data from various sources, it is very easy for that to happen. For example, if file X contains a merge of files A, B and C, and file Y contains a merge of files C, D and E -- potentially created by some other person or process or at some other time -- then you'll get C twice when I merge X and Y. -- David Booth, Ph.D. http://dbooth.org/ Opinions expressed herein are those of the author and do not necessarily reflect those of his employer.
Received on Wednesday, 19 December 2012 05:02:41 UTC