Re: Skolemization and RDF Semantics from Pat Hayes on 2011-04-20 (public-rdf-wg@w3.org from April 2011)

From: Pat Hayes <phayes@ihmc.us>
Date: Wed, 20 Apr 2011 17:07:34 -0500
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Steve Harris <steve.harris@garlik.com>, Dan Brickley <danbri@danbri.org>, David Wood <dpw@talis.com>, "RDF-WG public-rdf-wg@w3.org" <public-rdf-wg@w3.org>, Margaret Warren <info@margaretwarren.us>
Message-Id: <A491D755-50DA-4E39-8641-225B00BF1911@ihmc.us>
On Apr 20, 2011, at 7:10 AM, Richard Cyganiak wrote:

> Hi Pat,
> 
> On 17 Apr 2011, at 15:15, Pat Hayes wrote:
>>> First of all, it is *sometimes* but not *always* bad to use blank nodes. The documents I linked to gave specific advice, informed by implementation experience, for when to use, and when to avoid, blank nodes.
>> 
>> True, but it does say that the fewer bnodes the better, as a general rule about all data.
> 
> Well, you gotta simplify when talking to the man on the street. A more accurate phrasing: Substituting a blank node with a URI never makes data less useful. (Assuming the blank node is actually used as a local name and not an existential variable, which is the case for all data published on the Web that I've ever seen, with the exception of blank nodes in rdf:Lists.)

I dont think you need to add the qualification, actually. Just say, it never makes it less useful *when you can do it*.. 

> 
> Doing this substitution might be costly for the publisher of the data, especially if they'd like their URIs to be stable and persistent, but the claim was about usefulness. The increase in usefulness for consumers of the data may or may not make it cost-effective.
> 
> 
> Regarding your questions below:
> 
> The typical scenario on the Web is that party A publishes some data as RDF on the Web. Now party B wants to use that data. For example, B might have some local data and they want to enrich this with the data from A, perhaps by loading graphs from A and from B into a single store. This requires that both *graphs* actually “join up” when loaded into the store, in all the places where common entities are described. This joining up of graphs is necessary because SPARQL and RDF APIs work on the graph level, not on the logic level. Things join up trivially if B uses URIs from A's data -- the graphs connect when merged. Inference (sameAs, IFP etc) can of course help here. But where that's not an option, if A uses blank nodes to make statements about entities of interest to B, then there is simply no way of making any of this work. A has made its data less useful to B.

Of course. Nobody is suggesting *deliberately* using blank nodes when you could use a URI, just in order to make things difficult.  But there are many cases where it makes perfect sense to use a blank node simply because there is no natural  identifier available for the entity in question. For example, I am helping a company design a system to mark up art images using RDFa. I have a drawing of a reclining nude. How do I say this? I want to say the drawing depicts **a person** who is female, nude and in a reclining position. I have absolutely no idea who this person actually is, or even if there ever was a model for this drawing. It seems absolutely natural and correct for me to use a blank node here: the drawing depicts _:xx who has rdf:type :human and rdf:type :female and .. etc.. Obviously, I could coin a URI to denote this hypothetical model, but that URI would not convey anything that is not conveyed by the bnode, and it might well be interpreted to mean that I have information about the model (since I apparently have a name that 'identiifes' her, which URIs are said to do); but of course I don't have any information about her, which is why the bnode is useful. 

This kind of construction comes up in our project all the time; almost every artistic classification has a hidden existential in it somewhere (a landscape drawing is one that depicts **a** landscape, etc.. Are we to be obliged to invent URIs to refer to all these things? Every artwork will have a cloud of URIs surrounding it to refer to the people and places it might depict, the particular piece of paint that was used to make a particular mark, the particular composition line that it alone has, etc.. What purpose is served by this proliferation of unresolvable URIs?

> The early FOAF practice of using blank nodes for people is an interesting case study here. The idea was to use IFP inference plus rdfs:seeAlso links to form a web and join the blank nodes, but this didn't work very because this multi-step join process including IFP inference in practice is much more brittle than directly using someone else's resolvable URI.

Sure, when said URis are available. But what about when they are not, and any invented ones will never be resolvable? (I am guessing that this early practice was adopted because at the time, most people did not have URIs. BTW, most people still don't have *resolvable* URIs. And the cases I have in mind are those where there will almost certainly never be a resolvable URI.) 

> And try making an owl:sameAs link from your URI to someone else's blank node ...

Right, that does not fly :-)

> This all is complicated further in cases where A's data is dynamic, and B wants to keep their local store updated as A's data changes.
> 
> It's complicated further in cases where A wants to allow other parties to make updates to its data.

Well, OK, but really this has not been addressed by any of the RDF specs so far. We really ought to be talking about these updating issues in more depth, I think, if are going to write anything about them. g-boxes are a good start. 

> The update request needs to be serialized as a text and sent over the wire at some point. It's reasonable to ask for a way of characterizing updates that is very close to the RDF data model. In ground RDF graphs, this is easy -- set operations between graphs. In the presence of blank nodes, this gets much more difficult. SPARQL Update helps a bit here, because you can SELECT the blank node to be UPDATEd, but this requires a lot of knowledge of the constraints of the data, otherwise one can accidentally write an underconstrained SELECT that matches too many resources, and update the wrong data. It is *very* hard to build a generic graph browser and editor based on SPARQL + SPARQL UPDATE that works correctly with blank node rich data, and it requires expensive queries to pinpoint one's position in the graph.
> 
> Part of the problem is that real RDF systems often are built by plugging together different components -- stores, parsers, APIs, serializers, servers -- and any of them might rename blank nodes that pass through it at any time, since the specs say that this is ok.

No, they say that if you do rename them, the resulting graph is equivalent to the first. But that is not a licence to rename. There might be all sorts of pragmatic reasons to keep bnodeIDs stable, and even to transfer them between software components. The specs don't talk about behaviors of processing systems *at all*. 

> To add a triple to a blank node, I may have to hold on to its blank node identifier for some time before I can complete the operation. But in a system of multiple parts, it's really hard to be sure about the behaviour of these identifiers, and to understand the guarantees that different systems have with regard to their stability. Will a concurrent update change the identifier of the node I am holding? Will a version upgrade change all the blank node identifiers?

This seems to me to just be an issue of being confused about the scopes of local identifiers. 

> 
> These problems all wouldn't exist if A hadn't decided to use a blank node ...
> 
> (And yeah I know this is all just because uneducated engineers misunderstand how to correctly use blank nodes ... But unfortunately it *is* their problems that RDF has to solve to be relevant.)
> 
> This is why A *really* should think twice for every blank node they put into their data. And this is why I *really* want to see some guidance on this topic in the place where people are likely to read about blank nodes.

I think we have to keep the guidance separate from the normative specs. But Im all for providing guidance, as long as it really is good advice that is going to stay good for at least a decade.

Pat

> 
> Best,
> Richard
> 
> 
> 
>>> Given a triple _:a :bbb :ccc, it is not possible to author another triple _:a :xxx :yyy in another graph, the intention being that _:a is the same thing in both graphs. Given that the blank node label is arbitrary and cannot be assumed to be persistent, it is not possible to refer to the graph node from outside of the system where the graph originated.
>> 
>> I think you mean not possible to refer to the entity denoted by the blank node from outside, etc. To do that you have to give it a name, indeed. You can do this, if it is absolutely necessary,  by adding 
>> _:a owl:sameAs <URI> .
>> to the first graph and then using the URI outside. So it is possible when it needs to be done.
>> 
>>> Such outside reference to certain nodes is a requirement in a distributed system.
>> 
>> ...why? . Surely it all depends on the node in question. Some things need to be publicly referable to, and these obviously should be given a URI. Others don't. The inner lists in an RDF collection used to encode some OWL syntax should never need to be referred to elsewhere, for example. 
>> 
>>> 
>>>> *Why* is data using them worse than data which does not?
>>> 
>>> Because it is difficult to augment data that uses blank nodes with further data. Because it requires stepping outside of the RDF data model in order to remotely modify or otherwise work with an RDF graph that uses blank nodes.
>> 
>> For the first point, see above. I don't follow the second point. **Of course** it is possible to modify RDF containing blank nodes, just as one can with ground RDF. An RDF graph is just a large data object, you can do whatever you want to it.  Can you be more precise about what exactly the problems are here?
>> 
>>>> Worse in what sense, exactly?
>>> 
>>> Worse in the sense that it imposes large, and often prohibitive, additional costs on users of the data, which usually is not in the best interest of the publishers of the data.
>> 
>> You have not yet convinced me why or how this is so. 
>> 
>>>> Which processes are made more difficult when blank nodes are present?
>>> 
>>> Referring to nodes in the graph from other data; storing persistent references to a graph node for later recall;
>> 
>> You can't refer to nodes in RDF at all. I think what you mean is, URIs allow one to refer to the same entity in different graphs, whereas bnodeIDs are local to the graph and so have no meaning outside the graph. True; but again, I don't see why this is a practical problem. What plausible processes would ever need to access a locally scoped ID? Can you give an example? 
>> 
>>> integrating RDF graphs from different sources
>> 
>> What bnode problem is encountered here? 
>> 
>>> ; hyperlinking between RDF graphs
>> 
>> Again, why do bnodes cause a problem with such linking? 
>> 
>>> ; updating and modifying RDF graphs;
>> 
>> And again, I do not see any reason why the presence of bnodes makes updating and modifying more difficult. 
>> 
>>> merging RDF graphs;
>> 
>> Well, yes, there is a cost here, but it is surely not high enough to warrant such a draconian rule. How often do such merges happen? And in such a case, what the spec should do, at most, is point out the cost, not recommend courses of action based on the presumed need to avoid it.
>> 
>>>> And so forth. If answers to such questions are available, then let us discuss them and publish them if we all agree, but even then only in an informative note, not as part of the spec. 
>>> 
>>> The purpose of a specification is to promote interoperability between implementations. Implementation advice and usage notes are an important part of that. What are you trying to achieve by objecting to the inclusion of such material into the specification?
>> 
>> I just want to make sure that this material is based in fact and not just a kind of folk rumor. Specifications have to last for years and be usable in a wider range of circumstances than their writers (us) can imagine. They have to pass a very high barrier of accuracy and precision, therefore. 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Wednesday, 20 April 2011 22:08:13 UTC