Re: Skolemization and RDF Semantics from Richard Cyganiak on 2011-04-20 (public-rdf-wg@w3.org from April 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Wed, 20 Apr 2011 13:10:42 +0100
To: Pat Hayes <phayes@ihmc.us>
Cc: Steve Harris <steve.harris@garlik.com>, Dan Brickley <danbri@danbri.org>, David Wood <dpw@talis.com>, "public-rdf-wg@w3.org" <public-rdf-wg@w3.org>
Message-Id: <1B328BC0-499E-47C2-8A8D-01CDF62B4F81@cyganiak.de>
Hi Pat,

On 17 Apr 2011, at 15:15, Pat Hayes wrote:
>> First of all, it is *sometimes* but not *always* bad to use blank nodes. The documents I linked to gave specific advice, informed by implementation experience, for when to use, and when to avoid, blank nodes.
> 
> True, but it does say that the fewer bnodes the better, as a general rule about all data.

Well, you gotta simplify when talking to the man on the street. A more accurate phrasing: Substituting a blank node with a URI never makes data less useful. (Assuming the blank node is actually used as a local name and not an existential variable, which is the case for all data published on the Web that I've ever seen, with the exception of blank nodes in rdf:Lists.)

Doing this substitution might be costly for the publisher of the data, especially if they'd like their URIs to be stable and persistent, but the claim was about usefulness. The increase in usefulness for consumers of the data may or may not make it cost-effective.


Regarding your questions below:

The typical scenario on the Web is that party A publishes some data as RDF on the Web. Now party B wants to use that data. For example, B might have some local data and they want to enrich this with the data from A, perhaps by loading graphs from A and from B into a single store. This requires that both *graphs* actually “join up” when loaded into the store, in all the places where common entities are described. This joining up of graphs is necessary because SPARQL and RDF APIs work on the graph level, not on the logic level. Things join up trivially if B uses URIs from A's data -- the graphs connect when merged. Inference (sameAs, IFP etc) can of course help here. But where that's not an option, if A uses blank nodes to make statements about entities of interest to B, then there is simply no way of making any of this work. A has made its data less useful to B.

The early FOAF practice of using blank nodes for people is an interesting case study here. The idea was to use IFP inference plus rdfs:seeAlso links to form a web and join the blank nodes, but this didn't work very because this multi-step join process including IFP inference in practice is much more brittle than directly using someone else's resolvable URI.

And try making an owl:sameAs link from your URI to someone else's blank node ...

This all is complicated further in cases where A's data is dynamic, and B wants to keep their local store updated as A's data changes.

It's complicated further in cases where A wants to allow other parties to make updates to its data. The update request needs to be serialized as a text and sent over the wire at some point. It's reasonable to ask for a way of characterizing updates that is very close to the RDF data model. In ground RDF graphs, this is easy -- set operations between graphs. In the presence of blank nodes, this gets much more difficult. SPARQL Update helps a bit here, because you can SELECT the blank node to be UPDATEd, but this requires a lot of knowledge of the constraints of the data, otherwise one can accidentally write an underconstrained SELECT that matches too many resources, and update the wrong data. It is *very* hard to build a generic graph browser and editor based on SPARQL + SPARQL UPDATE that works correctly with blank node rich data, and it requires expensive queries to pinpoint one's position in the graph.

Part of the problem is that real RDF systems often are built by plugging together different components -- stores, parsers, APIs, serializers, servers -- and any of them might rename blank nodes that pass through it at any time, since the specs say that this is ok. To add a triple to a blank node, I may have to hold on to its blank node identifier for some time before I can complete the operation. But in a system of multiple parts, it's really hard to be sure about the behaviour of these identifiers, and to understand the guarantees that different systems have with regard to their stability. Will a concurrent update change the identifier of the node I am holding? Will a version upgrade change all the blank node identifiers?

These problems all wouldn't exist if A hadn't decided to use a blank node ...

(And yeah I know this is all just because uneducated engineers misunderstand how to correctly use blank nodes ... But unfortunately it *is* their problems that RDF has to solve to be relevant.)

This is why A *really* should think twice for every blank node they put into their data. And this is why I *really* want to see some guidance on this topic in the place where people are likely to read about blank nodes.

Best,
Richard



>> Given a triple _:a :bbb :ccc, it is not possible to author another triple _:a :xxx :yyy in another graph, the intention being that _:a is the same thing in both graphs. Given that the blank node label is arbitrary and cannot be assumed to be persistent, it is not possible to refer to the graph node from outside of the system where the graph originated.
> 
> I think you mean not possible to refer to the entity denoted by the blank node from outside, etc. To do that you have to give it a name, indeed. You can do this, if it is absolutely necessary,  by adding 
> _:a owl:sameAs <URI> .
> to the first graph and then using the URI outside. So it is possible when it needs to be done.
> 
>> Such outside reference to certain nodes is a requirement in a distributed system.
> 
> ...why? . Surely it all depends on the node in question. Some things need to be publicly referable to, and these obviously should be given a URI. Others don't. The inner lists in an RDF collection used to encode some OWL syntax should never need to be referred to elsewhere, for example. 
> 
>> 
>>> *Why* is data using them worse than data which does not?
>> 
>> Because it is difficult to augment data that uses blank nodes with further data. Because it requires stepping outside of the RDF data model in order to remotely modify or otherwise work with an RDF graph that uses blank nodes.
> 
> For the first point, see above. I don't follow the second point. **Of course** it is possible to modify RDF containing blank nodes, just as one can with ground RDF. An RDF graph is just a large data object, you can do whatever you want to it.  Can you be more precise about what exactly the problems are here?
> 
>>> Worse in what sense, exactly?
>> 
>> Worse in the sense that it imposes large, and often prohibitive, additional costs on users of the data, which usually is not in the best interest of the publishers of the data.
> 
> You have not yet convinced me why or how this is so. 
> 
>>> Which processes are made more difficult when blank nodes are present?
>> 
>> Referring to nodes in the graph from other data; storing persistent references to a graph node for later recall;
> 
> You can't refer to nodes in RDF at all. I think what you mean is, URIs allow one to refer to the same entity in different graphs, whereas bnodeIDs are local to the graph and so have no meaning outside the graph. True; but again, I don't see why this is a practical problem. What plausible processes would ever need to access a locally scoped ID? Can you give an example? 
> 
>> integrating RDF graphs from different sources
> 
> What bnode problem is encountered here? 
> 
>> ; hyperlinking between RDF graphs
> 
> Again, why do bnodes cause a problem with such linking? 
> 
>> ; updating and modifying RDF graphs;
> 
> And again, I do not see any reason why the presence of bnodes makes updating and modifying more difficult. 
> 
>> merging RDF graphs;
> 
> Well, yes, there is a cost here, but it is surely not high enough to warrant such a draconian rule. How often do such merges happen? And in such a case, what the spec should do, at most, is point out the cost, not recommend courses of action based on the presumed need to avoid it.
> 
>>> And so forth. If answers to such questions are available, then let us discuss them and publish them if we all agree, but even then only in an informative note, not as part of the spec. 
>> 
>> The purpose of a specification is to promote interoperability between implementations. Implementation advice and usage notes are an important part of that. What are you trying to achieve by objecting to the inclusion of such material into the specification?
> 
> I just want to make sure that this material is based in fact and not just a kind of folk rumor. Specifications have to last for years and be usable in a wider range of circumstances than their writers (us) can imagine. They have to pass a very high barrier of accuracy and precision, therefore.
Received on Wednesday, 20 April 2011 12:11:11 UTC