Re: owl:sameAs - Harmful to provenance? from David Booth on 2013-04-08 (public-semweb-lifesci@w3.org from April 2013)

From: David Booth <david@dbooth.org>
Date: Mon, 08 Apr 2013 10:07:28 -0400
To: Pat Hayes <phayes@ihmc.us>
CC: Peter Ansell <ansell.peter@gmail.com>, Alan Ruttenberg <alanruttenberg@gmail.com>, public-semweb-lifesci <public-semweb-lifesci@w3.org>
Message-ID: <5162CF20.6070107@dbooth.org>
Hi Pat,

On 04/04/2013 02:03 AM, Pat Hayes wrote:
>
> On Apr 3, 2013, at 9:00 PM, Peter Ansell wrote:
>
>> On 4 April 2013 11:58, David Booth <david@dbooth.org> wrote: On
>> 04/02/2013 05:02 PM, Alan Ruttenberg wrote: On Tuesday, April 2,
>> 2013, David Booth wrote: On 03/27/2013 10:56 PM, Pat Hayes wrote:
>> On Mar 27, 2013, at 7:32 PM, Jim McCusker wrote:
>>
>> If only owl:sameAs were used correctly...
>>
>> Well, I agree that is a problem, but don't draw the conclusion
>> that there is something wrong with sameAs, just because people keep
>> using it wrong.
>>
>> Agreed.  And furthermore, don't draw the conclusion that someone
>> has used owl:sameAs wrong just because you get garbage when you
>> merge two graphs that individually worked just fine.  Those two
>> graphs may have been written assuming different sets of
>> interpretations.
>>
>> In that case I would certainly conclude that they have used it
>> wrong. Have you not been reading what Pat and I have been writing?
>>
>> I've read lots of what you and Pat have written.  And I've learned
>> a lot from it -- particularly in learning about ambiguity from Pat.
>> And I'm in full agreement that owl:sameAs is *often* misused.
>>
>> But I don't believe that getting garbage when merging two graphs
>> that individually worked fine *necessarily* indicates that
>> owl:sameAs was misused -- even when it appears on the surface to be
>> causing the problem.
>
> I agree, but not with your example and your analysis of it.
>
>> Here's a simple example to illustrate.
>>
>> Using the following prefixes throughout, for brevity:
>>
>> @prefix :    <http://example/owen/> . @prefix owl:
>> <http://www.w3.org/2002/07/owl#> .
>>
>> Suppose that Owen is the URI owner of :x, :y and :z, and Owen
>> defines them as follows:
>>
>> # Owen's URI definition for :x, :y and :z :x a :Something . :y a
>> :Something . :z a :Something .
>>
>> That's all.  That's Owen's entire definition of those URIs.
>> Obviously this definition is "ambiguous" in some sense.  But as we
>> know, ambiguity is ultimately inescapable anyway, so I have merely
>> chosen an example that makes the ambiguity obvious. As the RDF
>> Semantics spec puts it: "It is usually impossible to assert enough
>> in any language to completely constrain the interpretations to a
>> single possible world".
>
> Yes, but by making the ambiguity this "obvious", you have rendered
> the example pointless. There is *no* content here *at all*, so Owen
> has not really published anything. This is not typical of published
> content, even in RDF. Typically, in fact, there is, as well as some
> nontrivial actual RDF content, some kind of explanation, perhaps in
> natural language, of what the *intended* content of the formal RDF is
> supposed to be. While an RDF engine cannot of course make use of such
> intuitive explanations, other authors of RDF can, and should, make
> use of it to try to ensure that they do not make assertions which
> would be counter to the referential intentions of the original
> authors. For example, the Dublin Core URIs were published with almost
> no formal RDF axioms, but quite elaborate natural language glosses
> which enable them to be used in formal RDF with considerable success.
> The fact that formal (and even informal) data is inherently ambiguous
> does not mean that it is inherently, or even typically, vacuous.

This seems to suggest that natural language can somehow eliminate 
ambiguity, where formal languages cannot.  I don't buy that.  Presumably 
whatever definition one expressed in natural language could be expressed 
in a formal language -- in principle at least.  And certainly the goal 
of the semantic web is to have such information expressed in a formal 
language that is amenable to machine processing.

More precisely, the basic assumption I am making is that for (almost) 
any definition there exists a property such that neither that property 
nor its negation are entailed by the definition.  I.e., there is always 
more than can be said about the thing whose identity is defined.  Maybe 
that assumption is wrong; I don't know.  If you think it's wrong, I'd be 
interested in hearing why.

The example may not be "realistic", but it is *not* pointless.  The 
whole point of choosing such a simple example is to expose the 
fundamental issues outright, rather than obscuring them in complexity 
that we cannot fully understand.  If there is some fundamental reason 
why you think this problem cannot happen in a more "realistic" example, 
then please explain what mechanism would come into play to prevent it.

>
>> Arthur, an RDF author, publishes the following graph, G1, making
>> certain assumptions about the interpretations that will be applied
>> to it:
>>
>> # G1 :x owl:sameAs :y .
>
> On what basis does Arthur make this assertion? The URIs were coined
> by Owen, and Owen says nothing that would sanction this assumption.

Why Arthur or anyone else chooses to assert whatever they choose to 
assert is their business.  It is irrelevant to this analysis.

>
>> Aster, another RDF author, publishes the following graph, G2,
>> making certain other assumptions about the interpretations that
>> will be applied to it:
>>
>> # G2 :x owl:differentFrom :z .
>>
>> Alfred, a third RDF author, publishes the following graph, G3,
>> making still other assumptions about the interpretations that will
>> be applied to it:
>>
>> # G3 :y owl:differentFrom :z .
>
> Similarly for the other two. They are making assertions using names
> that belong to, and were coined by, another author without having any
> possible source of justification for these nontrivial claims. This
> should not be regarded as good practice, to put it mildly.

Ditto.  If you are claiming that an RDF author needs some sort of 
"justification" to make assertions, then please explain exactly what you 
mean -- preferably in formal terms -- by "justification".  E.g., does 
"justification" mean that Arthur may only make assertions that are 
entailed by Owen's definition?  I already discussed that possibility below.

>
>> Note that G1, G2 and G3 are all individually consistent with Owen's
>> URI definition.  Furthermore, G1, G2 and G3 are all pair-wise
>> consistent: there exists at least one satisfying interpretation for
>> the merge of each pair.  But the merge of G1, G2 and G3 is not
>> consistent:
>
> This kind of behavior is of course quite typical in any assertional
> language.

Yes.

>
>> Arthur, Aster and Alfred made different assumptions about the set
>> of interpretations that would be applied to their graphs, and the
>> intersection of those sets was empty.
>>
>> Did Arthur misuse owl:sameAs?   What if Aster never published G2?
>> How could Aster's graph possibly affect the question of whether
>> *Arthur* misused owl:sameAs?  It would be nonsensical to assume
>> that it could.
>
> Why? Surely if Aster had a more reliable access to the primary source
> of information about these enigmatic thingies than Arthur did, then
> it might well be the case that Aster's publication could reveal
> errors in Arthur's, by contradicting him.

What do you mean by "more reliable"?  Both Arthur and Aster had access 
to the exact same URI definition from Owen.  Are you suggesting that 
Arthur and/or Aster should have used a *different* URI definition?  If 
so, what definition and why?

>
>> What if Owen later said that Arthur was correct, that :x == :y ?
>> What if he later said the opposite?  Again, it would seem rather
>> bizarre to say that the determination of whether Arthur had
>> misused owl:sameAs could be changed -- long after Arthur had
>> written G1 -- by Owen's later statements.
>
> Again, I don't find this bizarre in the least. It might be, if there
> was no truth of the matter concerning all this stuff, so that all
> these assertions were made independently with equal (or equal lack
> of) authority as to their actual truth. But that is so implausible
> and artificial an assumption that I don't see why we need to even
> discuss it.

The RDF Semantics is explicitly agnostic about interpretations and 
"actual truth".

Owen published a URI definition, and Arthur, Aster and Alfred published 
a bunch of assertions.  Whether anyone "believes" any of those 
assertions, whether those assertions have any bearing on the "real 
world", and whether they are at all useful to anyone's applications, are 
entirely different questions.  AFAICT those questions are irrelevant to 
the technical question of whether Arthur "misused" owl:sameAs.

>
>> One might claim that Arthur misused owl:sameAs because Owen had not
>> specified whether :x was the same or different from :y or :z, and
>> therefore Arthur had improperly *guessed* about the value of :x's
>> owl:sameAs property.
>>
>> But by that logic, Arthur would not be able to assert *anything*
>> new about :x.  I.e., Arthur would not be allowed to assert any
>> property whose value was not already entailed by Owen's
>> definition!
>
> Arthur may add information, of course. But Arthur is responsible for
> the truth of what he asserts, and part of that responsibility, in
> practice, is to take care to ascertain what the intended referents
> are of any URIs published by others, that Arthur then uses in his
> assertions.

But Arthur, Aster and Alfred were each fully diligent in ensuring that 
their assertions were consistent with all information that Owen 
provided.  What more could they do?

> For example, if I (as I recently did) wish to assert that
> something was red in color, I might use the URI
>
> http://linkedopencolors.moreways.net/color/rgb/ff0000.html
>
> rather than, say,
>
> http://linkedopencolors.moreways.net/color/rgb/00ff00.html
>
> because I know, using my color vision (not available to RDF engines)
> that the first one refers to red and the second one to green, which
> (I also know) is not red. I *could* use the second URI and insist
> that I intended it to denote the color red, but that would be stupid,
> since readers of my RDF will (and indeed should) misunderstand me. If
> I were to assert that
>
> http://linkedopencolors.moreways.net/color/rgb/00ff00.html
> owl:sameAs http://linkedopencolors.moreways.net/color/css/red.html
> .
>
> then I would be saying something false. And yes, in that case, it
> *is* my error, even if what I have said is formally consistent (which
> it in fact is) with the published RDF "definition" of these URis
> (which is in fact empty.)

In that example there were additional constraints that were not 
expressed formally -- such as the fact that red and green are different 
colors, and what wavelengths correspond to which colors, etc.  But 
unless you are claiming that assertions expressed in natural language 
can somehow avoid ambiguity where formal assertions cannot, then for the 
sake of analysis we can assume that all assertions have been expressed 
formally.

I am also assuming that in the vast majority of cases, a URI's resource 
identity will be defined by a description, rather than by ostension
http://plato.stanford.edu/entries/identity/
so I am focusing on that case.

>
>> And that would render RDF rather pointless.
>
> Why would it render it pointless? The point of RDF is not to make
> completely unjustified statements about nothing in particular.

RDF is designed to allow anyone to say anything about anything.  If 
someone chooses to make completely unjustified statements about nothing 
in particular, that is their business.  AFAICT that is completely 
irrelevant to the technical question of whether owl:sameAs was used 
incorrectly.

>
>> Maybe someone can see a way to avoid this dilemma.  Maybe someone
>> can figure out a way to distinguish between the "essential"
>> properties that serve to identify a resource, and other
>> "inessential" properties that the resource might have. If so, and
>> the number of "essential" properties is finite, then indeed this
>> problem could be avoided by requiring every URI owner to define all
>> of the "essential" properties of the URI's denoted resource, or by
>> prohibiting anyone but the URI owner from asserting any new
>> "essential" properties of the resource (beyond those the URI owner
>> had defined).  Or maybe there is another way around this dilemma.
>
> What do you see the "dilemma" here as being, exactly? It seems to me
> that this is not about RDF as such at all. It is about data, however
> that data is recorded. People can publish data about things. They do
> so by making assertions. In an ideal world, everyone is responsible
> for the assertions they make. Other people can put together
> information from various sources, but the reliability of the result
> is hostage to the reliability of all the sources that are used. All
> this is kind of obvious, but what else is being said in this thread?

The dilemma is that we would like each URI to always denote the same 
thing in all RDF datasets, so that when we merge RDF datasets, the merge 
will make sense: the merge will be consistent and an application that 
worked properly on an individual RDF dataset will also work properly on 
the merge of that dataset with other datasets.  But because URI 
definitions are inherently ambiguous, different RDF authors will 
interpret them differently, and this leads to inconsistencies when 
datasets are merged -- even when all parties have acted in good faith 
and have done all that they could reasonably have been expected to do to 
avoid such conflicts.

Key assumptions:

  1. Owen's URI definition will always be ambiguous.  There will always 
exist a property p such that neither p nor its negation are entailed by 
the URI definition.

  2. Owen cannot be expected to forever refine his URI definition by 
adding disambiguation at the request of every RDF author who uses his 
URIs.  At some point, Owen will reach the point of saying "that's all 
the disambiguation you get".  (This is the point at which the example 
that I gave begins.)

>
>>
>> Unless some way around this dilemma is found, it seems unreasonably
>> judgemental to accuse Arthur of misusing owl:sameAs in this case,
>
> Possibly, yes, but not because...
>
>> since he didn't assert anything that was inconsistent with Owen's
>> URI definition
>
> Consistency is not the point. If I make completely unfounded
> assertions about a topic that you have introduced, then the fact they
> might be logically consistent with what you have said is neither here
> nor there. What matters is whether I have the authority to make the
> assertions I do, or whether I am lying, fabricating or simply
> fantasizing using Owen's vocabulary.

Can you translate that into more objective technical terms?  What 
exactly does "unfounded" mean?  And what do you mean by "authority"? 
What objective technical criteria are you suggesting?  And why is it 
relevant to the question of whether Arthur misused owl:sameAs, given 
that the RDF Semantics is explicitly agnostic about interpretations?

David Booth
Received on Monday, 8 April 2013 14:08:00 UTC