Re: Skolemization and RDF Semantics from Ivan Herman on 2011-04-27 (public-rdf-wg@w3.org from April 2011)

From: Ivan Herman <ivan@w3.org>
Date: Wed, 27 Apr 2011 18:28:06 +0200
To: Richard Cyganiak <richard@cyganiak.de>
Cc: "RDF-WG public-rdf-wg@w3.org" <public-rdf-wg@w3.org>, Pat Hayes <phayes@ihmc.us>, Steve Harris <steve.harris@garlik.com>, Dan Brickley <danbri@danbri.org>, David Wood <dpw@talis.com>, Margaret Warren <info@margaretwarren.us>
Message-Id: <DFCF3D98-B658-49BF-894C-6FA237C1EB6C@w3.org>
On Apr 27, 2011, at 17:53 , Richard Cyganiak wrote:

> Hi Ivan,
> 
> On 27 Apr 2011, at 09:12, Ivan Herman wrote:
>> On the one hand, it is not the goal of this WG to remove blank nodes from RDF, or to change its semantics.
> 
> Just to be very clear: This option was never proposed or discussed anywhere in this thread. It is not an option that is on the table. We all understand that.
> 
>> On the other hand, it is also clear that many in the community have problems with blank node usage and, in many cases, would like to replace them with explicit URI-s, while still having the possibility to emphasize and get other tools to recognize the somewhat 'anonymous' nature of those nodes. I believe it is our job in the WG to give some guidance on how to achieve that in a proper manner. The formulation put forward by Sandro provided just that, and I think it would be a service to our community to get that in writing. Yes, it is informative, but very helpful nevertheless.
>> 
>> It has been raised that this should be part of, say, the Primer or the concept document. Personally, I have a preference for the latter, but that is only me.
> 
> +1
> 
>> However, we have to recognize that the .well-known 'scheme' of Sandro, as well as a urn:genid type URI require formal steps by the IETF to get those registered. In some sense, maybe the biggest service the WG could do is to get through these registrations so that the rest of the community could use those skolemization approaches (afaik, there is no similar registration for the tag: scheme if that is what we choose to use). I presume a proper document is necessary for those registrations, which is one more reason why the skolemization scheme should be part of our documents...
> 
> As you know, we have experience with registering .well-known prefixes (we did it for VoID), and it's fairly straightforward.
> 
> After last week's call, some of us spent some time on IRC trying to find a set of words that worked for everyone. The result is captured in slightly dissociated form in [1], and I'll give it here for reference:
> 
> | (Some intro text explaining what a skolem URI is)
> |
> | Systems wishing to skolemise bNodes, and expose those skolem
> | constants to external systems (e.g. in query results) SHOULD
> | mint a "fresh" (globally unique) URI for each bNode. Systems
> | performing skolemisation may wish to do so in a way that they
> | can recognise the constants once skolemised, and map back to
> | the source blank nodes where possible.
> |
> | Systems which want their skolem constants to be identifiable
> | by other systems SHOULD use the .well-known URI prefix. (W3C
> | will register an appropriate .well-known URI prefix with IANA
> | as per RFC 5758. Details TBD.)
> 
> Perhaps we can hear any objections that remain to this phrasing?

FWIW, I am happy with this phrasing. I see some minor open issues that we have to solve:

- do we stick to .well-known/genid, or do we want to use other terms instead of genid (there were some discussion about that)
- do we want to also have a non-http versions of the URI-s that are likely to be more readable. The tag: scheme came up (though that is not necessarily more readable) or the urn:genid: version. 

Both of these are minor, and we should be able to solve them quickly

> 
> (I also note that we don't appear to have an issue on the tracker for skolemization. I suspect that there was some reason for this that I'm missing; if that's not the case, then can someone please raise an issue?)


Done. Tracker, this is relevant for ISSUE-40

Thanks

Ivan



> 
> Best,
> Richard
> 
> [1] http://www.w3.org/2011/rdf-wg/wiki/Skolemization
> 
> 
>> 
>> Ivan
>> 
>> P.S. As an example, the RDB2RDF is also struggling with the issue of generating bnode-replacement-uri-s and similar schemes like Sandro's came to the fore. A good example where part of the community would have been very pleased to have a ready made solution at their disposal...
>> 
>> 
>> 
>> On Apr 27, 2011, at 01:03 , Richard Cyganiak wrote:
>> 
>>> Hi Pat,
>>> 
>>> On 20 Apr 2011, at 23:07, Pat Hayes wrote:
>>>> But there are many cases where it makes perfect sense to use a blank node simply because there is no natural  identifier available for the entity in question.
>>> 
>>> I don't think there are “natural identifiers” for anything -- someone always has to mint the URI, or define the URI scheme, or whatever.
>>> 
>>>> For example, I am helping a company design a system to mark up art images using RDFa. I have a drawing of a reclining nude. How do I say this? I want to say the drawing depicts **a person** who is female, nude and in a reclining position. I have absolutely no idea who this person actually is, or even if there ever was a model for this drawing. It seems absolutely natural and correct for me to use a blank node here: the drawing depicts _:xx who has rdf:type :human and rdf:type :female and .. etc.. Obviously, I could coin a URI to denote this hypothetical model, but that URI would not convey anything that is not conveyed by the bnode, and it might well be interpreted to mean that I have information about the model (since I apparently have a name that 'identiifes' her, which URIs are said to do); but of course I don't have any information about her, which is why the bnode is useful. 
>>> 
>>> Well, but later on I might look at the image and recognize the model, and I'd like to make that explicit with an owl:sameAs statement. If there's a blank node in the data, it's impossible. If there's a URI, I can use that to make my statement. By modeling the depicted person as a blank node, the publisher of the data has prevented a certain use case, and thus made the data (marginally, perhaps, but still) less useful.
>>> 
>>>> This kind of construction comes up in our project all the time; almost every artistic classification has a hidden existential in it somewhere (a landscape drawing is one that depicts **a** landscape, etc.. Are we to be obliged to invent URIs to refer to all these things?
>>> 
>>> No one is *obliged* to invent URIs for all these things. Nevertheless, doing the work and naming those things is a service to users of the data.
>>> 
>>>> Every artwork will have a cloud of URIs surrounding it to refer to the people and places it might depict, the particular piece of paint that was used to make a particular mark, the particular composition line that it alone has, etc.. What purpose is served by this proliferation of unresolvable URIs?
>>> 
>>> Resolvability is an orthogonal issue. (And at least in RDFa not a hard one.)
>>> 
>>> The purpose of this URI proliferation should be obvious: Others can link to the URIs. They can provide additional annotations. They can provide additional data as it becomes available in further analysis. And so on. As soon as you start crossing system boundaries (i.e., the Web), blank nodes become showstoppers for certain use cases.
>>> 
>>> (As I said earlier, minting a URI has a cost, and it has a benefit, and whether the total is positive or negative depends on the situation. But there *is* a benefit to naming things explicitly, always. It may just not be worth the effort.)
>>> 
>>>>> Part of the problem is that real RDF systems often are built by plugging together different components -- stores, parsers, APIs, serializers, servers -- and any of them might rename blank nodes that pass through it at any time, since the specs say that this is ok.
>>>> 
>>>> No, they say that if you do rename them, the resulting graph is equivalent to the first. But that is not a licence to rename.
>>> 
>>> Renaming blank nodes does not change the meaning of a graph. It still says the same thing. It is equivalent. The specs say all these things. How is that not a licence to rename?
>>> 
>>>> There might be all sorts of pragmatic reasons to keep bnodeIDs stable, and even to transfer them between software components.
>>> 
>>> This may have been a theoretical option when RDF was new, but the way the ecosystem has evolved, it is now no longer possible. That train has left the station long ago. There are very few stores (in-memory or persistent) that keep blank node labels stable when reading data. Try loading some RDF, adding a triple, and writing it out again, in any API. Chances are extremely high that you'll end up with different labels (or without labels, where the syntax allows it).
>>> 
>>>>> To add a triple to a blank node, I may have to hold on to its blank node identifier for some time before I can complete the operation. But in a system of multiple parts, it's really hard to be sure about the behaviour of these identifiers, and to understand the guarantees that different systems have with regard to their stability. Will a concurrent update change the identifier of the node I am holding? Will a version upgrade change all the blank node identifiers?
>>>> 
>>>> This seems to me to just be an issue of being confused about the scopes of local identifiers. 
>>> 
>>> No, the issue is that implementations generally treat blank node labels as not meaningful, and if a user assumes that any given system will keep blank nodes stable, then this assumption is very likely to bite them at some point (unless the system documentation makes explicit guarantees).
>>> 
>>>> I think we have to keep the guidance separate from the normative specs.
>>> 
>>> How about the Primer?
>>> 
>>> How about informative sections in Concepts and Abstract Syntax?
>>> 
>>>> But Im all for providing guidance, as long as it really is good advice that is going to stay good for at least a decade.
>>> 
>>> I'll get busy building that time machine right away so that I can prove the quality of the advice to your satisfaction.
>>> 
>>> Best,
>>> Richard
>>> 
>>> 
>>>> 
>>>> Pat
>>>> 
>>>>> 
>>>>> Best,
>>>>> Richard
>>>>> 
>>>>> 
>>>>> 
>>>>>>> Given a triple _:a :bbb :ccc, it is not possible to author another triple _:a :xxx :yyy in another graph, the intention being that _:a is the same thing in both graphs. Given that the blank node label is arbitrary and cannot be assumed to be persistent, it is not possible to refer to the graph node from outside of the system where the graph originated.
>>>>>> 
>>>>>> I think you mean not possible to refer to the entity denoted by the blank node from outside, etc. To do that you have to give it a name, indeed. You can do this, if it is absolutely necessary,  by adding 
>>>>>> _:a owl:sameAs <URI> .
>>>>>> to the first graph and then using the URI outside. So it is possible when it needs to be done.
>>>>>> 
>>>>>>> Such outside reference to certain nodes is a requirement in a distributed system.
>>>>>> 
>>>>>> ...why? . Surely it all depends on the node in question. Some things need to be publicly referable to, and these obviously should be given a URI. Others don't. The inner lists in an RDF collection used to encode some OWL syntax should never need to be referred to elsewhere, for example. 
>>>>>> 
>>>>>>> 
>>>>>>>> *Why* is data using them worse than data which does not?
>>>>>>> 
>>>>>>> Because it is difficult to augment data that uses blank nodes with further data. Because it requires stepping outside of the RDF data model in order to remotely modify or otherwise work with an RDF graph that uses blank nodes.
>>>>>> 
>>>>>> For the first point, see above. I don't follow the second point. **Of course** it is possible to modify RDF containing blank nodes, just as one can with ground RDF. An RDF graph is just a large data object, you can do whatever you want to it.  Can you be more precise about what exactly the problems are here?
>>>>>> 
>>>>>>>> Worse in what sense, exactly?
>>>>>>> 
>>>>>>> Worse in the sense that it imposes large, and often prohibitive, additional costs on users of the data, which usually is not in the best interest of the publishers of the data.
>>>>>> 
>>>>>> You have not yet convinced me why or how this is so. 
>>>>>> 
>>>>>>>> Which processes are made more difficult when blank nodes are present?
>>>>>>> 
>>>>>>> Referring to nodes in the graph from other data; storing persistent references to a graph node for later recall;
>>>>>> 
>>>>>> You can't refer to nodes in RDF at all. I think what you mean is, URIs allow one to refer to the same entity in different graphs, whereas bnodeIDs are local to the graph and so have no meaning outside the graph. True; but again, I don't see why this is a practical problem. What plausible processes would ever need to access a locally scoped ID? Can you give an example? 
>>>>>> 
>>>>>>> integrating RDF graphs from different sources
>>>>>> 
>>>>>> What bnode problem is encountered here? 
>>>>>> 
>>>>>>> ; hyperlinking between RDF graphs
>>>>>> 
>>>>>> Again, why do bnodes cause a problem with such linking? 
>>>>>> 
>>>>>>> ; updating and modifying RDF graphs;
>>>>>> 
>>>>>> And again, I do not see any reason why the presence of bnodes makes updating and modifying more difficult. 
>>>>>> 
>>>>>>> merging RDF graphs;
>>>>>> 
>>>>>> Well, yes, there is a cost here, but it is surely not high enough to warrant such a draconian rule. How often do such merges happen? And in such a case, what the spec should do, at most, is point out the cost, not recommend courses of action based on the presumed need to avoid it.
>>>>>> 
>>>>>>>> And so forth. If answers to such questions are available, then let us discuss them and publish them if we all agree, but even then only in an informative note, not as part of the spec. 
>>>>>>> 
>>>>>>> The purpose of a specification is to promote interoperability between implementations. Implementation advice and usage notes are an important part of that. What are you trying to achieve by objecting to the inclusion of such material into the specification?
>>>>>> 
>>>>>> I just want to make sure that this material is based in fact and not just a kind of folk rumor. Specifications have to last for years and be usable in a wider range of circumstances than their writers (us) can imagine. They have to pass a very high barrier of accuracy and precision, therefore. 
>>>>> 
>>>> 
>>>> ------------------------------------------------------------
>>>> IHMC                                     (850)434 8903 or (650)494 3973   
>>>> 40 South Alcaniz St.           (850)202 4416   office
>>>> Pensacola                            (850)202 4440   fax
>>>> FL 32502                              (850)291 0667   mobile
>>>> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> ----
>> Ivan Herman, W3C Semantic Web Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>> FOAF: http://www.ivan-herman.net/foaf.rdf
>> 
>> 
>> 
>> 
>> 
> 
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Wednesday, 27 April 2011 16:27:13 UTC