Re: owl:sameAs - Harmful to provenance? from Pat Hayes on 2013-04-09 (public-semweb-lifesci@w3.org from April 2013)

From: Pat Hayes <phayes@ihmc.us>
Date: Mon, 8 Apr 2013 23:48:46 -0700
To: David Booth <david@dbooth.org>
Cc: Peter Ansell <ansell.peter@gmail.com>, Alan Ruttenberg <alanruttenberg@gmail.com>, public-semweb-lifesci <public-semweb-lifesci@w3.org>
Message-Id: <9135314B-B2B9-4E73-8C76-3576EB1C1E79@ihmc.us>
On Apr 8, 2013, at 7:07 AM, David Booth wrote:

> Hi Pat,
> 
> On 04/04/2013 02:03 AM, Pat Hayes wrote:
>> 
>> On Apr 3, 2013, at 9:00 PM, Peter Ansell wrote:
>> 
>>> On 4 April 2013 11:58, David Booth <david@dbooth.org> wrote: On
>>> 04/02/2013 05:02 PM, Alan Ruttenberg wrote: On Tuesday, April 2,
>>> 2013, David Booth wrote: On 03/27/2013 10:56 PM, Pat Hayes wrote:
>>> On Mar 27, 2013, at 7:32 PM, Jim McCusker wrote:
>>> 
>>> If only owl:sameAs were used correctly...
>>> 
>>> Well, I agree that is a problem, but don't draw the conclusion
>>> that there is something wrong with sameAs, just because people keep
>>> using it wrong.
>>> 
>>> Agreed.  And furthermore, don't draw the conclusion that someone
>>> has used owl:sameAs wrong just because you get garbage when you
>>> merge two graphs that individually worked just fine.  Those two
>>> graphs may have been written assuming different sets of
>>> interpretations.
>>> 
>>> In that case I would certainly conclude that they have used it
>>> wrong. Have you not been reading what Pat and I have been writing?
>>> 
>>> I've read lots of what you and Pat have written.  And I've learned
>>> a lot from it -- particularly in learning about ambiguity from Pat.
>>> And I'm in full agreement that owl:sameAs is *often* misused.
>>> 
>>> But I don't believe that getting garbage when merging two graphs
>>> that individually worked fine *necessarily* indicates that
>>> owl:sameAs was misused -- even when it appears on the surface to be
>>> causing the problem.
>> 
>> I agree, but not with your example and your analysis of it.
>> 
>>> Here's a simple example to illustrate.
>>> 
>>> Using the following prefixes throughout, for brevity:
>>> 
>>> @prefix :    <http://example/owen/> . @prefix owl:
>>> <http://www.w3.org/2002/07/owl#> .
>>> 
>>> Suppose that Owen is the URI owner of :x, :y and :z, and Owen
>>> defines them as follows:
>>> 
>>> # Owen's URI definition for :x, :y and :z :x a :Something . :y a
>>> :Something . :z a :Something .
>>> 
>>> That's all.  That's Owen's entire definition of those URIs.
>>> Obviously this definition is "ambiguous" in some sense.  But as we
>>> know, ambiguity is ultimately inescapable anyway, so I have merely
>>> chosen an example that makes the ambiguity obvious. As the RDF
>>> Semantics spec puts it: "It is usually impossible to assert enough
>>> in any language to completely constrain the interpretations to a
>>> single possible world".
>> 
>> Yes, but by making the ambiguity this "obvious", you have rendered
>> the example pointless. There is *no* content here *at all*, so Owen
>> has not really published anything. This is not typical of published
>> content, even in RDF. Typically, in fact, there is, as well as some
>> nontrivial actual RDF content, some kind of explanation, perhaps in
>> natural language, of what the *intended* content of the formal RDF is
>> supposed to be. While an RDF engine cannot of course make use of such
>> intuitive explanations, other authors of RDF can, and should, make
>> use of it to try to ensure that they do not make assertions which
>> would be counter to the referential intentions of the original
>> authors. For example, the Dublin Core URIs were published with almost
>> no formal RDF axioms, but quite elaborate natural language glosses
>> which enable them to be used in formal RDF with considerable success.
>> The fact that formal (and even informal) data is inherently ambiguous
>> does not mean that it is inherently, or even typically, vacuous.
> 
> This seems to suggest that natural language can somehow eliminate ambiguity, where formal languages cannot.  I don't buy that.  

I don't buy that either, and I didn't say that. As we apparently agree that this is false, there isn't much point in discussing it further. 

> Presumably whatever definition one expressed in natural language could be expressed in a formal language -- in principle at least.

No. What can be expressed in formal languages is a tiny fraction of what can be expressed in natural language. Virtually all of 20th-century formal logic captures the meaning, more or less, of the English words "some", "all", "and", "or" and "not". That is a vanishingly small fraction of the total vocabulary of English or any other natural language. And most everyday words are natural kind terms that have no exact definition.

>  And certainly the goal of the semantic web is to have such information expressed in a formal language that is amenable to machine processing.

To have some of it so expressed, yes. 
> 
> More precisely, the basic assumption I am making is that for (almost) any definition there exists a property such that neither that property nor its negation are entailed by the definition.  I.e., there is always more than can be said about the thing whose identity is defined.

Fine, I agree with that. That is a pretty good way to capture the idea of a natural kind term, in fact. 

>  Maybe that assumption is wrong; I don't know.  If you think it's wrong, I'd be interested in hearing why.
> 
> The example may not be "realistic", but it is *not* pointless.  The whole point of choosing such a simple example is to expose the fundamental issues outright, rather than obscuring them in complexity that we cannot fully understand.  If there is some fundamental reason why you think this problem cannot happen in a more "realistic" example, then please explain what mechanism would come into play to prevent it.

What "problem" do you think there is here? I really have no idea what point you are making with this artificial toy example.

>> 
>>> Arthur, an RDF author, publishes the following graph, G1, making
>>> certain assumptions about the interpretations that will be applied
>>> to it:
>>> 
>>> # G1 :x owl:sameAs :y .
>> 
>> On what basis does Arthur make this assertion? The URIs were coined
>> by Owen, and Owen says nothing that would sanction this assumption.
> 
> Why Arthur or anyone else chooses to assert whatever they choose to assert is their business.  It is irrelevant to this analysis.

No, it is absolutely central. Of course anyone can assert anything they please about anything, but when we read this stuff we want to have some justification for why we should believe it. 

>>> Aster, another RDF author, publishes the following graph, G2,
>>> making certain other assumptions about the interpretations that
>>> will be applied to it:
>>> 
>>> # G2 :x owl:differentFrom :z .
>>> 
>>> Alfred, a third RDF author, publishes the following graph, G3,
>>> making still other assumptions about the interpretations that will
>>> be applied to it:
>>> 
>>> # G3 :y owl:differentFrom :z .
>> 
>> Similarly for the other two. They are making assertions using names
>> that belong to, and were coined by, another author without having any
>> possible source of justification for these nontrivial claims. This
>> should not be regarded as good practice, to put it mildly.
> 
> Ditto.  If you are claiming that an RDF author needs some sort of "justification" to make assertions, then please explain exactly what you mean -- preferably in formal terms -- by "justification".

This is not a formal discussion we are having,  it is a (kind of child's drawing of a) philosophical discussion. People (agents, perhaps) make assertions. Other people read those assertions. Should the readers believe or accept the assertions? That depends on the confidence they have about whether the agent doing the asserting is sincere and knows what they are talking about. There is a deep well of potential discussion about what exactly this means, but the general idea is surely understood by every adult human being. 

>  E.g., does "justification" mean that Arthur may only make assertions that are entailed by Owen's definition?

No, it means that Arthur is (a) talking about the same things as Owen was and (b) is trustworthy as a source of information about that topic. 

>  I already discussed that possibility below.
> 
>> 
>>> Note that G1, G2 and G3 are all individually consistent with Owen's
>>> URI definition.  Furthermore, G1, G2 and G3 are all pair-wise
>>> consistent: there exists at least one satisfying interpretation for
>>> the merge of each pair.  But the merge of G1, G2 and G3 is not
>>> consistent:
>> 
>> This kind of behavior is of course quite typical in any assertional
>> language.
> 
> Yes.
> 
>> 
>>> Arthur, Aster and Alfred made different assumptions about the set
>>> of interpretations that would be applied to their graphs, and the
>>> intersection of those sets was empty.
>>> 
>>> Did Arthur misuse owl:sameAs?   What if Aster never published G2?
>>> How could Aster's graph possibly affect the question of whether
>>> *Arthur* misused owl:sameAs?  It would be nonsensical to assume
>>> that it could.
>> 
>> Why? Surely if Aster had a more reliable access to the primary source
>> of information about these enigmatic thingies than Arthur did, then
>> it might well be the case that Aster's publication could reveal
>> errors in Arthur's, by contradicting him.
> 
> What do you mean by "more reliable"?  Both Arthur and Aster had access to the exact same URI definition from Owen.  

But in your example, Owen didn't supply any definition. Which is why this example is so silly and fails to be a realistic example of anything. There is *no way* that Arthur and Aster can *possibly* have reliable knowledge about what Owen was talking about, because Owen wasn't talking about anything. 

> Are you suggesting that Arthur and/or Aster should have used a *different* URI definition?  If so, what definition and why?
> 
>> 
>>> What if Owen later said that Arthur was correct, that :x == :y ?
>>> What if he later said the opposite?  Again, it would seem rather
>>> bizarre to say that the determination of whether Arthur had
>>> misused owl:sameAs could be changed -- long after Arthur had
>>> written G1 -- by Owen's later statements.
>> 
>> Again, I don't find this bizarre in the least. It might be, if there
>> was no truth of the matter concerning all this stuff, so that all
>> these assertions were made independently with equal (or equal lack
>> of) authority as to their actual truth. But that is so implausible
>> and artificial an assumption that I don't see why we need to even
>> discuss it.
> 
> The RDF Semantics is explicitly agnostic about interpretations and "actual truth".

The formal model theory is about how truth (and falsity) arise from interpretation mappings. That is "actual" truth we are talking about, but the model theory does not mention a host of other things that are relevant to truth, as the document tries to explain in its early introductory sections. 

> Owen published a URI definition

No, he didn't. A vacuous assertion is not a definition. 

> , and Arthur, Aster and Alfred published a bunch of assertions.  Whether anyone "believes" any of those assertions, whether those assertions have any bearing on the "real world", and whether they are at all useful to anyone's applications, are entirely different questions.  AFAICT those questions are irrelevant to the technical question of whether Arthur "misused" owl:sameAs.
> 
>> 
>>> One might claim that Arthur misused owl:sameAs because Owen had not
>>> specified whether :x was the same or different from :y or :z, and
>>> therefore Arthur had improperly *guessed* about the value of :x's
>>> owl:sameAs property.
>>> 
>>> But by that logic, Arthur would not be able to assert *anything*
>>> new about :x.  I.e., Arthur would not be allowed to assert any
>>> property whose value was not already entailed by Owen's
>>> definition!
>> 
>> Arthur may add information, of course. But Arthur is responsible for
>> the truth of what he asserts, and part of that responsibility, in
>> practice, is to take care to ascertain what the intended referents
>> are of any URIs published by others, that Arthur then uses in his
>> assertions.
> 
> But Arthur, Aster and Alfred were each fully diligent in ensuring that their assertions were consistent with all information that Owen provided.  What more could they do?

Consistency is not centrally important here. What they should do is to try to determine what Owen had been talking about, perhaps in the ultimate case by actually asking him, and then they could, if they are good citizens of course, publish more information about that topic if they have it and believe it to be true. What they should not do is make random assertions using Owen's URIs when they have no idea what Owen intended those URIs to denote. 

> 
>> For example, if I (as I recently did) wish to assert that
>> something was red in color, I might use the URI
>> 
>> http://linkedopencolors.moreways.net/color/rgb/ff0000.html
>> 
>> rather than, say,
>> 
>> http://linkedopencolors.moreways.net/color/rgb/00ff00.html
>> 
>> because I know, using my color vision (not available to RDF engines)
>> that the first one refers to red and the second one to green, which
>> (I also know) is not red. I *could* use the second URI and insist
>> that I intended it to denote the color red, but that would be stupid,
>> since readers of my RDF will (and indeed should) misunderstand me. If
>> I were to assert that
>> 
>> http://linkedopencolors.moreways.net/color/rgb/00ff00.html
>> owl:sameAs http://linkedopencolors.moreways.net/color/css/red.html
>> .
>> 
>> then I would be saying something false. And yes, in that case, it
>> *is* my error, even if what I have said is formally consistent (which
>> it in fact is) with the published RDF "definition" of these URis
>> (which is in fact empty.)
> 
> In that example there were additional constraints that were not expressed formally

Yes, exactly my point. 

> -- such as the fact that red and green are different colors, and what wavelengths correspond to which colors, etc.  But unless you are claiming that assertions expressed in natural language can somehow avoid ambiguity where formal assertions cannot, then for the sake of analysis we can assume that all assertions have been expressed formally.

No. I am not claiming that ambiguity can be entirely eliminated, but It does not follow that all of natural language can be expressed formally. (Color is actually a good example: color perception is not expressible in terms of wavelengths. AFAIK, there is no way to formally axiomatize perception of color.)

> 
> I am also assuming that in the vast majority of cases, a URI's resource identity will be defined by a description, rather than by ostension
> http://plato.stanford.edu/entries/identity/
> so I am focusing on that case.
> 
>> 
>>> And that would render RDF rather pointless.
>> 
>> Why would it render it pointless? The point of RDF is not to make
>> completely unjustified statements about nothing in particular.
> 
> RDF is designed to allow anyone to say anything about anything.  If someone chooses to make completely unjustified statements about nothing in particular, that is their business.  AFAICT that is completely irrelevant to the technical question of whether owl:sameAs was used incorrectly.
> 
>> 
>>> Maybe someone can see a way to avoid this dilemma.  Maybe someone
>>> can figure out a way to distinguish between the "essential"
>>> properties that serve to identify a resource, and other
>>> "inessential" properties that the resource might have. If so, and
>>> the number of "essential" properties is finite, then indeed this
>>> problem could be avoided by requiring every URI owner to define all
>>> of the "essential" properties of the URI's denoted resource, or by
>>> prohibiting anyone but the URI owner from asserting any new
>>> "essential" properties of the resource (beyond those the URI owner
>>> had defined).  Or maybe there is another way around this dilemma.
>> 
>> What do you see the "dilemma" here as being, exactly? It seems to me
>> that this is not about RDF as such at all. It is about data, however
>> that data is recorded. People can publish data about things. They do
>> so by making assertions. In an ideal world, everyone is responsible
>> for the assertions they make. Other people can put together
>> information from various sources, but the reliability of the result
>> is hostage to the reliability of all the sources that are used. All
>> this is kind of obvious, but what else is being said in this thread?
> 
> The dilemma is that we would like each URI to always denote the same thing in all RDF datasets, so that when we merge RDF datasets, the merge will make sense: the merge will be consistent and an application that worked properly on an individual RDF dataset will also work properly on the merge of that dataset with other datasets.  But because URI definitions are inherently ambiguous, different RDF authors will interpret them differently,

OK so far, but...

> and this leads to inconsistencies when datasets are merged

...Wrong. At least, this does not follow. Take some RDF, call it A, and some other, call it B. I believe A, and you believe B. Both A and B contain some URIs, whose referents are ambiguous: there are many things that they could denote, given the truth of both A and B. There are interpretations of A which do not satisfy B, and vice versa: so my beliefs allow for interpretations of those ambiguous URIs which would be ruled out by your beliefs, and vice versa. All this can be true, and still A and B may be mutually consistent. In fact, this is the normal case. When we merge A and B, the new, larger, piece of RDF (A+B) represents what might be called our joint beliefs about the things that the URIs are being taken to denote. We each learn something from the merge, and each know more about these things that the URis denote. The ambiguous URis are now slightly less ambiguous. 

What is the problem that you see with this? This is a sketch of the basic process of taking information from multiple sources and combining it to draw new conclusions. We all do this all the time we are awake, without effort or being caught up in some kind of logical problem. True, it can happen that inconsistencies emerge when information is combined. When they do, we have a variety of strategies for dealing with that situation. We might push back on one of our information sources to check it ("Did you say he was supposed to be here by 2 oclock?") or we might take a second look, or simply think harder to try to resolve the problem ("I thought I had left it on the table, but maybe my memory is faulty.")

> -- even when all parties have acted in good faith and have done all that they could reasonably have been expected to do to avoid such conflicts.

Actually, I think this is relatively rare when indeed good practices have been followed. Your artificial example is hardly one of people following good practice, note. 

> 
> Key assumptions:
> 
> 1. Owen's URI definition will always be ambiguous.

True, but the extent can be minimized. And in some cases it can be eliminated completely, by ostention. (My color example is arguably such a case, in fact.)

>  There will always exist a property p such that neither p nor its negation are entailed by the URI definition.

Of course, but that is not a definition of ambiguity. 

> 
> 2. Owen cannot be expected to forever refine his URI definition by adding disambiguation at the request of every RDF author who uses his URIs.  At some point, Owen will reach the point of saying "that's all the disambiguation you get".  (This is the point at which the example that I gave begins.)

I think that in practice, it is relatively straighforward, when composing actual data, to ensure that there is enough information available to disambiguate sufficiently to enable most relevant entailments to be established. Take for example (one of my favorites), data about Everest. The name, "Everest", is what one might call ontologically ambiguous: it refers to a mountain, but there are probably as many ways to individuate the term "mountain" as there are ontologists on the planet. So any data about everest is going to be deeply ambiguous. Still, this ambiguity is a sense does not matter because it is orthogonal to the information that is being expressed, such as the height, the weather history, the dates it was climbed by various people, the names of the frozen corpses, etc., The fact that two people might have different ontologies of mountainhood is irrelevant to data like this: each will interpret the data using their own notion of mountain, and they will be referring to the same thing in the actual world, even if their conceptualizations of this one thing are different, and even incompatible. There can be cases where the conceptualization does matter, of course: in the case of Everest, where one draws its geographical boundary is a matter of some dispute, and merging RDF from Chinese and Nepalese sources might well get you an OWL inconsistency. But this is exactly what one would expect: in this case, the various sources genuinely disagree about something. Yes, indeed, that can produce inconsistencies of one tries to believe them both. But again, I see no semantic *problem* here. This is exactly what one would expect to happen, and the logical semantics is operating correctly when it detects inconsistencies such as this. 

>>> 
>>> Unless some way around this dilemma is found, it seems unreasonably
>>> judgemental to accuse Arthur of misusing owl:sameAs in this case,
>> 
>> Possibly, yes, but not because...
>> 
>>> since he didn't assert anything that was inconsistent with Owen's
>>> URI definition
>> 
>> Consistency is not the point. If I make completely unfounded
>> assertions about a topic that you have introduced, then the fact they
>> might be logically consistent with what you have said is neither here
>> nor there. What matters is whether I have the authority to make the
>> assertions I do, or whether I am lying, fabricating or simply
>> fantasizing using Owen's vocabulary.
> 
> Can you translate that into more objective technical terms?

No. I was (and am) writing in English, about publishing data on the Web. That is (I hope) what we are talking about here. 

>  What exactly does "unfounded" mean?  

In the above, it means that the author of the information has no basis for making the assertions, and does not know what the terms used in those assertions were intended to mean. 

> And what do you mean by "authority"? What objective technical criteria are you suggesting?  And why is it relevant to the question of whether Arthur misused

What do you mean by "misused"? If anyone can make any assertions on any, or no, basis of information, what can possibly count as "misuse"?

> owl:sameAs, given that the RDF Semantics is explicitly agnostic about interpretations?

We are not talking about the RDF semantics, but about the business of combining data from multiple, diverse, sources. The RDF semantics does not deal with this larger matter, although of course it is relevant to it when the information is encoded in RDF.

Pat


> 
> David Booth
> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Tuesday, 9 April 2013 08:20:28 UTC