Re: owl:sameAs - Harmful to provenance? from Phillip Lord on 2013-04-08 (public-semweb-lifesci@w3.org from April 2013)

From: Phillip Lord <phillip.lord@newcastle.ac.uk>
Date: Mon, 08 Apr 2013 17:53:26 +0100
To: Oliver Ruebenacker <curoli@gmail.com>
Cc: David Booth <david@dbooth.org>, Pat Hayes <phayes@ihmc.us>, Peter Ansell <ansell.peter@gmail.com>, "Alan Ruttenberg" <alanruttenberg@gmail.com>, public-semweb-lifesci <public-semweb-lifesci@w3.org>
Message-ID: <871ualq7zt.fsf@newcastle.ac.uk>
And it is this bit -- "before we can do anything useful" that is utterly
wrong. 

Recently I have spent a lot of time look at Dublin Core creator fields.
You could not believe how many different ways they are used. String
literals ("Phillip Lord"), last-first ("Lord, Phillip"), with abbrevs
("P. Lord"), multi-author ("Phillip Lord; Lindsay Marshall"), with
titles ("Dr Phillip Lord") and so on. 

So, is everyone using Dublin Core wrong? It is useless till everyone
uses it the same way? Emphatically no, it is not useless.

Would it better if everybody did use it the same way? The answer is
probably not. Names are incredibly complex, and representing them is, in
turn, difficult and hard. Any specificiation which did full justice to
all the different name forms in existance would be incredibly
long-winded. Many people using the specification would get it wrong; or
you could have a mechanism for ensuring people always used it correctly.
Then I am sure that both people who ended up using this form of spec
would have great fun integrating their tiny datasets.

In the example, we have a number of sets of assertions which
individually fulfil their creators use-cases. Then, when they are bought
together, the assertions become inconsistent, telling you up front that
there is work to be done. And you ask in what way is this useful?

Perfection is the enemy of Good.



Oliver Ruebenacker <curoli@gmail.com> writes:
>   So what most people here are saying is that before we can do anything
> useful, we need to make sure that if two assertions use the same reference,
> they mean the same thing.
>
>   To which you respond that you will accept assertions without assuming
> that same references mean same things. You will just keep them separate.
> There is no rule against that.
>
>   But in what way is this useful?
>
>      Take care
>      Oliver
>
> On Mon, Apr 8, 2013 at 10:07 AM, David Booth <david@dbooth.org> wrote:
>
>> Hi Pat,
>>
>>
>> On 04/04/2013 02:03 AM, Pat Hayes wrote:
>>
>>>
>>> On Apr 3, 2013, at 9:00 PM, Peter Ansell wrote:
>>>
>>>  On 4 April 2013 11:58, David Booth <david@dbooth.org> wrote: On
>>>> 04/02/2013 05:02 PM, Alan Ruttenberg wrote: On Tuesday, April 2,
>>>> 2013, David Booth wrote: On 03/27/2013 10:56 PM, Pat Hayes wrote:
>>>> On Mar 27, 2013, at 7:32 PM, Jim McCusker wrote:
>>>>
>>>> If only owl:sameAs were used correctly...
>>>>
>>>> Well, I agree that is a problem, but don't draw the conclusion
>>>> that there is something wrong with sameAs, just because people keep
>>>> using it wrong.
>>>>
>>>> Agreed.  And furthermore, don't draw the conclusion that someone
>>>> has used owl:sameAs wrong just because you get garbage when you
>>>> merge two graphs that individually worked just fine.  Those two
>>>> graphs may have been written assuming different sets of
>>>> interpretations.
>>>>
>>>> In that case I would certainly conclude that they have used it
>>>> wrong. Have you not been reading what Pat and I have been writing?
>>>>
>>>> I've read lots of what you and Pat have written.  And I've learned
>>>> a lot from it -- particularly in learning about ambiguity from Pat.
>>>> And I'm in full agreement that owl:sameAs is *often* misused.
>>>>
>>>> But I don't believe that getting garbage when merging two graphs
>>>> that individually worked fine *necessarily* indicates that
>>>> owl:sameAs was misused -- even when it appears on the surface to be
>>>> causing the problem.
>>>>
>>>
>>> I agree, but not with your example and your analysis of it.
>>>
>>>  Here's a simple example to illustrate.
>>>>
>>>> Using the following prefixes throughout, for brevity:
>>>>
>>>> @prefix :    <http://example/owen/> . @prefix owl:
>>>> <http://www.w3.org/2002/07/**owl# <http://www.w3.org/2002/07/owl#>> .
>>>>
>>>> Suppose that Owen is the URI owner of :x, :y and :z, and Owen
>>>> defines them as follows:
>>>>
>>>> # Owen's URI definition for :x, :y and :z :x a :Something . :y a
>>>> :Something . :z a :Something .
>>>>
>>>> That's all.  That's Owen's entire definition of those URIs.
>>>> Obviously this definition is "ambiguous" in some sense.  But as we
>>>> know, ambiguity is ultimately inescapable anyway, so I have merely
>>>> chosen an example that makes the ambiguity obvious. As the RDF
>>>> Semantics spec puts it: "It is usually impossible to assert enough
>>>> in any language to completely constrain the interpretations to a
>>>> single possible world".
>>>>
>>>
>>> Yes, but by making the ambiguity this "obvious", you have rendered
>>> the example pointless. There is *no* content here *at all*, so Owen
>>> has not really published anything. This is not typical of published
>>> content, even in RDF. Typically, in fact, there is, as well as some
>>> nontrivial actual RDF content, some kind of explanation, perhaps in
>>> natural language, of what the *intended* content of the formal RDF is
>>> supposed to be. While an RDF engine cannot of course make use of such
>>> intuitive explanations, other authors of RDF can, and should, make
>>> use of it to try to ensure that they do not make assertions which
>>> would be counter to the referential intentions of the original
>>> authors. For example, the Dublin Core URIs were published with almost
>>> no formal RDF axioms, but quite elaborate natural language glosses
>>> which enable them to be used in formal RDF with considerable success.
>>> The fact that formal (and even informal) data is inherently ambiguous
>>> does not mean that it is inherently, or even typically, vacuous.
>>>
>>
>> This seems to suggest that natural language can somehow eliminate
>> ambiguity, where formal languages cannot.  I don't buy that.  Presumably
>> whatever definition one expressed in natural language could be expressed in
>> a formal language -- in principle at least.  And certainly the goal of the
>> semantic web is to have such information expressed in a formal language
>> that is amenable to machine processing.
>>
>> More precisely, the basic assumption I am making is that for (almost) any
>> definition there exists a property such that neither that property nor its
>> negation are entailed by the definition.  I.e., there is always more than
>> can be said about the thing whose identity is defined.  Maybe that
>> assumption is wrong; I don't know.  If you think it's wrong, I'd be
>> interested in hearing why.
>>
>> The example may not be "realistic", but it is *not* pointless.  The whole
>> point of choosing such a simple example is to expose the fundamental issues
>> outright, rather than obscuring them in complexity that we cannot fully
>> understand.  If there is some fundamental reason why you think this problem
>> cannot happen in a more "realistic" example, then please explain what
>> mechanism would come into play to prevent it.
>>
>>
>>
>>>  Arthur, an RDF author, publishes the following graph, G1, making
>>>> certain assumptions about the interpretations that will be applied
>>>> to it:
>>>>
>>>> # G1 :x owl:sameAs :y .
>>>>
>>>
>>> On what basis does Arthur make this assertion? The URIs were coined
>>> by Owen, and Owen says nothing that would sanction this assumption.
>>>
>>
>> Why Arthur or anyone else chooses to assert whatever they choose to assert
>> is their business.  It is irrelevant to this analysis.
>>
>>
>>
>>>  Aster, another RDF author, publishes the following graph, G2,
>>>> making certain other assumptions about the interpretations that
>>>> will be applied to it:
>>>>
>>>> # G2 :x owl:differentFrom :z .
>>>>
>>>> Alfred, a third RDF author, publishes the following graph, G3,
>>>> making still other assumptions about the interpretations that will
>>>> be applied to it:
>>>>
>>>> # G3 :y owl:differentFrom :z .
>>>>
>>>
>>> Similarly for the other two. They are making assertions using names
>>> that belong to, and were coined by, another author without having any
>>> possible source of justification for these nontrivial claims. This
>>> should not be regarded as good practice, to put it mildly.
>>>
>>
>> Ditto.  If you are claiming that an RDF author needs some sort of
>> "justification" to make assertions, then please explain exactly what you
>> mean -- preferably in formal terms -- by "justification".  E.g., does
>> "justification" mean that Arthur may only make assertions that are entailed
>> by Owen's definition?  I already discussed that possibility below.
>>
>>
>>
>>>  Note that G1, G2 and G3 are all individually consistent with Owen's
>>>> URI definition.  Furthermore, G1, G2 and G3 are all pair-wise
>>>> consistent: there exists at least one satisfying interpretation for
>>>> the merge of each pair.  But the merge of G1, G2 and G3 is not
>>>> consistent:
>>>>
>>>
>>> This kind of behavior is of course quite typical in any assertional
>>> language.
>>>
>>
>> Yes.
>>
>>
>>
>>>  Arthur, Aster and Alfred made different assumptions about the set
>>>> of interpretations that would be applied to their graphs, and the
>>>> intersection of those sets was empty.
>>>>
>>>> Did Arthur misuse owl:sameAs?   What if Aster never published G2?
>>>> How could Aster's graph possibly affect the question of whether
>>>> *Arthur* misused owl:sameAs?  It would be nonsensical to assume
>>>> that it could.
>>>>
>>>
>>> Why? Surely if Aster had a more reliable access to the primary source
>>> of information about these enigmatic thingies than Arthur did, then
>>> it might well be the case that Aster's publication could reveal
>>> errors in Arthur's, by contradicting him.
>>>
>>
>> What do you mean by "more reliable"?  Both Arthur and Aster had access to
>> the exact same URI definition from Owen.  Are you suggesting that Arthur
>> and/or Aster should have used a *different* URI definition?  If so, what
>> definition and why?
>>
>>
>>>  What if Owen later said that Arthur was correct, that :x == :y ?
>>>> What if he later said the opposite?  Again, it would seem rather
>>>> bizarre to say that the determination of whether Arthur had
>>>> misused owl:sameAs could be changed -- long after Arthur had
>>>> written G1 -- by Owen's later statements.
>>>>
>>>
>>> Again, I don't find this bizarre in the least. It might be, if there
>>> was no truth of the matter concerning all this stuff, so that all
>>>
>>> these assertions were made independently with equal (or equal lack
>>> of) authority as to their actual truth. But that is so implausible
>>> and artificial an assumption that I don't see why we need to even
>>> discuss it.
>>>
>>
>> The RDF Semantics is explicitly agnostic about interpretations and "actual
>> truth".
>>
>> Owen published a URI definition, and Arthur, Aster and Alfred published a
>> bunch of assertions.  Whether anyone "believes" any of those assertions,
>> whether those assertions have any bearing on the "real world", and whether
>> they are at all useful to anyone's applications, are entirely different
>> questions.  AFAICT those questions are irrelevant to the technical question
>> of whether Arthur "misused" owl:sameAs.
>>
>>
>>
>>>  One might claim that Arthur misused owl:sameAs because Owen had not
>>>> specified whether :x was the same or different from :y or :z, and
>>>> therefore Arthur had improperly *guessed* about the value of :x's
>>>> owl:sameAs property.
>>>>
>>>> But by that logic, Arthur would not be able to assert *anything*
>>>> new about :x.  I.e., Arthur would not be allowed to assert any
>>>> property whose value was not already entailed by Owen's
>>>> definition!
>>>>
>>>
>>> Arthur may add information, of course. But Arthur is responsible for
>>> the truth of what he asserts, and part of that responsibility, in
>>> practice, is to take care to ascertain what the intended referents
>>> are of any URIs published by others, that Arthur then uses in his
>>> assertions.
>>>
>>
>> But Arthur, Aster and Alfred were each fully diligent in ensuring that
>> their assertions were consistent with all information that Owen provided.
>>  What more could they do?
>>
>>
>>  For example, if I (as I recently did) wish to assert that
>>> something was red in color, I might use the URI
>>>
>>> http://linkedopencolors.**moreways.net/color/rgb/ff0000.**html<http://linkedopencolors.moreways.net/color/rgb/ff0000.html>
>>>
>>> rather than, say,
>>>
>>> http://linkedopencolors.**moreways.net/color/rgb/00ff00.**html<http://linkedopencolors.moreways.net/color/rgb/00ff00.html>
>>>
>>> because I know, using my color vision (not available to RDF engines)
>>> that the first one refers to red and the second one to green, which
>>> (I also know) is not red. I *could* use the second URI and insist
>>> that I intended it to denote the color red, but that would be stupid,
>>> since readers of my RDF will (and indeed should) misunderstand me. If
>>> I were to assert that
>>>
>>> http://linkedopencolors.**moreways.net/color/rgb/00ff00.**html<http://linkedopencolors.moreways.net/color/rgb/00ff00.html>
>>> owl:sameAs
>>> http://linkedopencolors.**moreways.net/color/css/red.**html<http://linkedopencolors.moreways.net/color/css/red.html>
>>> .
>>>
>>> then I would be saying something false. And yes, in that case, it
>>> *is* my error, even if what I have said is formally consistent (which
>>> it in fact is) with the published RDF "definition" of these URis
>>> (which is in fact empty.)
>>>
>>
>> In that example there were additional constraints that were not expressed
>> formally -- such as the fact that red and green are different colors, and
>> what wavelengths correspond to which colors, etc.  But unless you are
>> claiming that assertions expressed in natural language can somehow avoid
>> ambiguity where formal assertions cannot, then for the sake of analysis we
>> can assume that all assertions have been expressed formally.
>>
>> I am also assuming that in the vast majority of cases, a URI's resource
>> identity will be defined by a description, rather than by ostension
>> http://plato.stanford.edu/**entries/identity/<http://plato.stanford.edu/entries/identity/>
>> so I am focusing on that case.
>>
>>
>>
>>>  And that would render RDF rather pointless.
>>>>
>>>
>>> Why would it render it pointless? The point of RDF is not to make
>>> completely unjustified statements about nothing in particular.
>>>
>>
>> RDF is designed to allow anyone to say anything about anything.  If
>> someone chooses to make completely unjustified statements about nothing in
>> particular, that is their business.  AFAICT that is completely irrelevant
>> to the technical question of whether owl:sameAs was used incorrectly.
>>
>>
>>
>>>  Maybe someone can see a way to avoid this dilemma.  Maybe someone
>>>> can figure out a way to distinguish between the "essential"
>>>> properties that serve to identify a resource, and other
>>>> "inessential" properties that the resource might have. If so, and
>>>> the number of "essential" properties is finite, then indeed this
>>>> problem could be avoided by requiring every URI owner to define all
>>>> of the "essential" properties of the URI's denoted resource, or by
>>>> prohibiting anyone but the URI owner from asserting any new
>>>> "essential" properties of the resource (beyond those the URI owner
>>>> had defined).  Or maybe there is another way around this dilemma.
>>>>
>>>
>>> What do you see the "dilemma" here as being, exactly? It seems to me
>>> that this is not about RDF as such at all. It is about data, however
>>> that data is recorded. People can publish data about things. They do
>>> so by making assertions. In an ideal world, everyone is responsible
>>> for the assertions they make. Other people can put together
>>> information from various sources, but the reliability of the result
>>> is hostage to the reliability of all the sources that are used. All
>>> this is kind of obvious, but what else is being said in this thread?
>>>
>>
>> The dilemma is that we would like each URI to always denote the same thing
>> in all RDF datasets, so that when we merge RDF datasets, the merge will
>> make sense: the merge will be consistent and an application that worked
>> properly on an individual RDF dataset will also work properly on the merge
>> of that dataset with other datasets.  But because URI definitions are
>> inherently ambiguous, different RDF authors will interpret them
>> differently, and this leads to inconsistencies when datasets are merged --
>> even when all parties have acted in good faith and have done all that they
>> could reasonably have been expected to do to avoid such conflicts.
>>
>> Key assumptions:
>>
>>  1. Owen's URI definition will always be ambiguous.  There will always
>> exist a property p such that neither p nor its negation are entailed by the
>> URI definition.
>>
>>  2. Owen cannot be expected to forever refine his URI definition by adding
>> disambiguation at the request of every RDF author who uses his URIs.  At
>> some point, Owen will reach the point of saying "that's all the
>> disambiguation you get".  (This is the point at which the example that I
>> gave begins.)
>>
>>
>>
>>>
>>>> Unless some way around this dilemma is found, it seems unreasonably
>>>> judgemental to accuse Arthur of misusing owl:sameAs in this case,
>>>>
>>>
>>> Possibly, yes, but not because...
>>>
>>>  since he didn't assert anything that was inconsistent with Owen's
>>>> URI definition
>>>>
>>>
>>> Consistency is not the point. If I make completely unfounded
>>> assertions about a topic that you have introduced, then the fact they
>>> might be logically consistent with what you have said is neither here
>>> nor there. What matters is whether I have the authority to make the
>>> assertions I do, or whether I am lying, fabricating or simply
>>> fantasizing using Owen's vocabulary.
>>>
>>
>> Can you translate that into more objective technical terms?  What exactly
>> does "unfounded" mean?  And what do you mean by "authority"? What objective
>> technical criteria are you suggesting?  And why is it relevant to the
>> question of whether Arthur misused owl:sameAs, given that the RDF Semantics
>> is explicitly agnostic about interpretations?
>>
>> David Booth
>>
>>

-- 
Phillip Lord,                           Phone: +44 (0) 191 222 7827
Lecturer in Bioinformatics,             Email: phillip.lord@newcastle.ac.uk
School of Computing Science,            http://homepages.cs.ncl.ac.uk/phillip.lord
Room 914 Claremont Tower,               skype: russet_apples
Newcastle University,                   twitter: phillord
NE1 7RU
Received on Monday, 8 April 2013 16:54:01 UTC