Re: owl:sameAs - Harmful to provenance? from Phillip Lord on 2013-04-09 (public-semweb-lifesci@w3.org from April 2013)

From: Phillip Lord <phillip.lord@newcastle.ac.uk>
Date: Tue, 09 Apr 2013 16:31:42 +0100
To: Alan Ruttenberg <alanruttenberg@gmail.com>
Cc: "Bhat\, Talapady N." <talapady.bhat@nist.gov>, Oliver Ruebenacker <curoli@gmail.com>, David Booth <david@dbooth.org>, Pat Hayes <phayes@ihmc.us>, Peter Ansell <ansell.peter@gmail.com>, public-semweb-lifesci <public-semweb-lifesci@w3.org>
Message-ID: <87ppy3lnz5.fsf@newcastle.ac.uk>
Compare all you like. RDF is just another technology; it's not going to
let me do anything that I cannot do in another way. I'm interested in
using it because it is there, not for any other reason. 

The surface syntax problem; yeah, it is and remains a pain, more some in
some areas than others. 

Phil


Alan Ruttenberg <alanruttenberg@gmail.com> writes:
> Thinking about "metadata" as some other category of data is usually a bad
> sign. I've often found it to mean, in practice, "data I care less about".
>
> Phil, to make the case that RDF helps here, we would want to compare how
> easy it is to do significant work using the ill-represented examples you
> find versus raw text, versus xml, versus tab-delimited files. While there
> is some limited benefit to getting rid of the surface syntax problem, it's
> not clear how much of a problem that ever was.
>
> -Alan
>
>
> On Mon, Apr 8, 2013 at 1:16 PM, Bhat, Talapady N. <talapady.bhat@nist.gov>wrote:
>
>> Hi,
>> -----
>>
>> Introduction -Dublin Core:
>> The Dublin Core Metadata Element Set is a vocabulary of fifteen properties
>> for use in resource description. The name "Dublin" is due to its origin at
>> a 1995 invitational workshop in Dublin, Ohio; "core" because its elements
>> are broad and generic, usable for describing a wide range of resources.
>>
>> The fifteen element "Dublin Core" described in this standard is part of a
>> larger set of metadata vocabularies
>> --------------------------------------
>> As per the introduction (given above) section of doubling core (
>> http://dublincore.org/documents/dces/) its focus is primarily metadata
>> whereas the actual author names mentioned below probably need be considered
>> as 'data'. I do not think Dublin core has really focused on building
>> standard re-usable vocabulary for 'data'. That is the real problem. That is
>> why we have been focusing on re-usable terms for 'data'
>>
>> http://www.biomedcentral.com/1471-2105/12/487  and
>> http://xpdb.nist.gov/chemblast/pdb.pl  and
>> http://www.nature.com/nmeth/journal/v9/n7/abs/nmeth.2084.html
>>
>> T N Bhat
>>
>> -----Original Message-----
>> From: Phillip Lord [mailto:phillip.lord@newcastle.ac.uk]
>> Sent: Monday, April 08, 2013 12:53 PM
>> To: Oliver Ruebenacker
>> Cc: David Booth; Pat Hayes; Peter Ansell; Alan Ruttenberg;
>> public-semweb-lifesci
>> Subject: Re: owl:sameAs - Harmful to provenance?
>>
>>
>> And it is this bit -- "before we can do anything useful" that is utterly
>> wrong.
>>
>> Recently I have spent a lot of time look at Dublin Core creator fields.
>> You could not believe how many different ways they are used. String
>> literals ("Phillip Lord"), last-first ("Lord, Phillip"), with abbrevs ("P.
>> Lord"), multi-author ("Phillip Lord; Lindsay Marshall"), with titles ("Dr
>> Phillip Lord") and so on.
>>
>> So, is everyone using Dublin Core wrong? It is useless till everyone uses
>> it the same way? Emphatically no, it is not useless.
>>
>> Would it better if everybody did use it the same way? The answer is
>> probably not. Names are incredibly complex, and representing them is, in
>> turn, difficult and hard. Any specificiation which did full justice to all
>> the different name forms in existance would be incredibly long-winded. Many
>> people using the specification would get it wrong; or you could have a
>> mechanism for ensuring people always used it correctly.
>> Then I am sure that both people who ended up using this form of spec would
>> have great fun integrating their tiny datasets.
>>
>> In the example, we have a number of sets of assertions which individually
>> fulfil their creators use-cases. Then, when they are bought together, the
>> assertions become inconsistent, telling you up front that there is work to
>> be done. And you ask in what way is this useful?
>>
>> Perfection is the enemy of Good.
>>
>>
>>
>> Oliver Ruebenacker <curoli@gmail.com> writes:
>> >   So what most people here are saying is that before we can do
>> > anything useful, we need to make sure that if two assertions use the
>> > same reference, they mean the same thing.
>> >
>> >   To which you respond that you will accept assertions without
>> > assuming that same references mean same things. You will just keep them
>> separate.
>> > There is no rule against that.
>> >
>> >   But in what way is this useful?
>> >
>> >      Take care
>> >      Oliver
>> >
>> > On Mon, Apr 8, 2013 at 10:07 AM, David Booth <david@dbooth.org> wrote:
>> >
>> >> Hi Pat,
>> >>
>> >>
>> >> On 04/04/2013 02:03 AM, Pat Hayes wrote:
>> >>
>> >>>
>> >>> On Apr 3, 2013, at 9:00 PM, Peter Ansell wrote:
>> >>>
>> >>>  On 4 April 2013 11:58, David Booth <david@dbooth.org> wrote: On
>> >>>> 04/02/2013 05:02 PM, Alan Ruttenberg wrote: On Tuesday, April 2,
>> >>>> 2013, David Booth wrote: On 03/27/2013 10:56 PM, Pat Hayes wrote:
>> >>>> On Mar 27, 2013, at 7:32 PM, Jim McCusker wrote:
>> >>>>
>> >>>> If only owl:sameAs were used correctly...
>> >>>>
>> >>>> Well, I agree that is a problem, but don't draw the conclusion that
>> >>>> there is something wrong with sameAs, just because people keep
>> >>>> using it wrong.
>> >>>>
>> >>>> Agreed.  And furthermore, don't draw the conclusion that someone
>> >>>> has used owl:sameAs wrong just because you get garbage when you
>> >>>> merge two graphs that individually worked just fine.  Those two
>> >>>> graphs may have been written assuming different sets of
>> >>>> interpretations.
>> >>>>
>> >>>> In that case I would certainly conclude that they have used it
>> >>>> wrong. Have you not been reading what Pat and I have been writing?
>> >>>>
>> >>>> I've read lots of what you and Pat have written.  And I've learned
>> >>>> a lot from it -- particularly in learning about ambiguity from Pat.
>> >>>> And I'm in full agreement that owl:sameAs is *often* misused.
>> >>>>
>> >>>> But I don't believe that getting garbage when merging two graphs
>> >>>> that individually worked fine *necessarily* indicates that
>> >>>> owl:sameAs was misused -- even when it appears on the surface to be
>> >>>> causing the problem.
>> >>>>
>> >>>
>> >>> I agree, but not with your example and your analysis of it.
>> >>>
>> >>>  Here's a simple example to illustrate.
>> >>>>
>> >>>> Using the following prefixes throughout, for brevity:
>> >>>>
>> >>>> @prefix :    <http://example/owen/> . @prefix owl:
>> >>>> <http://www.w3.org/2002/07/**owl# <http://www.w3.org/2002/07/owl#>> .
>> >>>>
>> >>>> Suppose that Owen is the URI owner of :x, :y and :z, and Owen
>> >>>> defines them as follows:
>> >>>>
>> >>>> # Owen's URI definition for :x, :y and :z :x a :Something . :y a
>> >>>> :Something . :z a :Something .
>> >>>>
>> >>>> That's all.  That's Owen's entire definition of those URIs.
>> >>>> Obviously this definition is "ambiguous" in some sense.  But as we
>> >>>> know, ambiguity is ultimately inescapable anyway, so I have merely
>> >>>> chosen an example that makes the ambiguity obvious. As the RDF
>> >>>> Semantics spec puts it: "It is usually impossible to assert enough
>> >>>> in any language to completely constrain the interpretations to a
>> >>>> single possible world".
>> >>>>
>> >>>
>> >>> Yes, but by making the ambiguity this "obvious", you have rendered
>> >>> the example pointless. There is *no* content here *at all*, so Owen
>> >>> has not really published anything. This is not typical of published
>> >>> content, even in RDF. Typically, in fact, there is, as well as some
>> >>> nontrivial actual RDF content, some kind of explanation, perhaps in
>> >>> natural language, of what the *intended* content of the formal RDF
>> >>> is supposed to be. While an RDF engine cannot of course make use of
>> >>> such intuitive explanations, other authors of RDF can, and should,
>> >>> make use of it to try to ensure that they do not make assertions
>> >>> which would be counter to the referential intentions of the original
>> >>> authors. For example, the Dublin Core URIs were published with
>> >>> almost no formal RDF axioms, but quite elaborate natural language
>> >>> glosses which enable them to be used in formal RDF with considerable
>> success.
>> >>> The fact that formal (and even informal) data is inherently
>> >>> ambiguous does not mean that it is inherently, or even typically,
>> vacuous.
>> >>>
>> >>
>> >> This seems to suggest that natural language can somehow eliminate
>> >> ambiguity, where formal languages cannot.  I don't buy that.
>> >> Presumably whatever definition one expressed in natural language
>> >> could be expressed in a formal language -- in principle at least.
>> >> And certainly the goal of the semantic web is to have such
>> >> information expressed in a formal language that is amenable to machine
>> processing.
>> >>
>> >> More precisely, the basic assumption I am making is that for (almost)
>> >> any definition there exists a property such that neither that
>> >> property nor its negation are entailed by the definition.  I.e.,
>> >> there is always more than can be said about the thing whose identity
>> >> is defined.  Maybe that assumption is wrong; I don't know.  If you
>> >> think it's wrong, I'd be interested in hearing why.
>> >>
>> >> The example may not be "realistic", but it is *not* pointless.  The
>> >> whole point of choosing such a simple example is to expose the
>> >> fundamental issues outright, rather than obscuring them in complexity
>> >> that we cannot fully understand.  If there is some fundamental reason
>> >> why you think this problem cannot happen in a more "realistic"
>> >> example, then please explain what mechanism would come into play to
>> prevent it.
>> >>
>> >>
>> >>
>> >>>  Arthur, an RDF author, publishes the following graph, G1, making
>> >>>> certain assumptions about the interpretations that will be applied
>> >>>> to it:
>> >>>>
>> >>>> # G1 :x owl:sameAs :y .
>> >>>>
>> >>>
>> >>> On what basis does Arthur make this assertion? The URIs were coined
>> >>> by Owen, and Owen says nothing that would sanction this assumption.
>> >>>
>> >>
>> >> Why Arthur or anyone else chooses to assert whatever they choose to
>> >> assert is their business.  It is irrelevant to this analysis.
>> >>
>> >>
>> >>
>> >>>  Aster, another RDF author, publishes the following graph, G2,
>> >>>> making certain other assumptions about the interpretations that
>> >>>> will be applied to it:
>> >>>>
>> >>>> # G2 :x owl:differentFrom :z .
>> >>>>
>> >>>> Alfred, a third RDF author, publishes the following graph, G3,
>> >>>> making still other assumptions about the interpretations that will
>> >>>> be applied to it:
>> >>>>
>> >>>> # G3 :y owl:differentFrom :z .
>> >>>>
>> >>>
>> >>> Similarly for the other two. They are making assertions using names
>> >>> that belong to, and were coined by, another author without having
>> >>> any possible source of justification for these nontrivial claims.
>> >>> This should not be regarded as good practice, to put it mildly.
>> >>>
>> >>
>> >> Ditto.  If you are claiming that an RDF author needs some sort of
>> >> "justification" to make assertions, then please explain exactly what
>> >> you mean -- preferably in formal terms -- by "justification".  E.g.,
>> >> does "justification" mean that Arthur may only make assertions that
>> >> are entailed by Owen's definition?  I already discussed that
>> possibility below.
>> >>
>> >>
>> >>
>> >>>  Note that G1, G2 and G3 are all individually consistent with Owen's
>> >>>> URI definition.  Furthermore, G1, G2 and G3 are all pair-wise
>> >>>> consistent: there exists at least one satisfying interpretation for
>> >>>> the merge of each pair.  But the merge of G1, G2 and G3 is not
>> >>>> consistent:
>> >>>>
>> >>>
>> >>> This kind of behavior is of course quite typical in any assertional
>> >>> language.
>> >>>
>> >>
>> >> Yes.
>> >>
>> >>
>> >>
>> >>>  Arthur, Aster and Alfred made different assumptions about the set
>> >>>> of interpretations that would be applied to their graphs, and the
>> >>>> intersection of those sets was empty.
>> >>>>
>> >>>> Did Arthur misuse owl:sameAs?   What if Aster never published G2?
>> >>>> How could Aster's graph possibly affect the question of whether
>> >>>> *Arthur* misused owl:sameAs?  It would be nonsensical to assume
>> >>>> that it could.
>> >>>>
>> >>>
>> >>> Why? Surely if Aster had a more reliable access to the primary
>> >>> source of information about these enigmatic thingies than Arthur
>> >>> did, then it might well be the case that Aster's publication could
>> >>> reveal errors in Arthur's, by contradicting him.
>> >>>
>> >>
>> >> What do you mean by "more reliable"?  Both Arthur and Aster had
>> >> access to the exact same URI definition from Owen.  Are you
>> >> suggesting that Arthur and/or Aster should have used a *different*
>> >> URI definition?  If so, what definition and why?
>> >>
>> >>
>> >>>  What if Owen later said that Arthur was correct, that :x == :y ?
>> >>>> What if he later said the opposite?  Again, it would seem rather
>> >>>> bizarre to say that the determination of whether Arthur had misused
>> >>>> owl:sameAs could be changed -- long after Arthur had written G1 --
>> >>>> by Owen's later statements.
>> >>>>
>> >>>
>> >>> Again, I don't find this bizarre in the least. It might be, if there
>> >>> was no truth of the matter concerning all this stuff, so that all
>> >>>
>> >>> these assertions were made independently with equal (or equal lack
>> >>> of) authority as to their actual truth. But that is so implausible
>> >>> and artificial an assumption that I don't see why we need to even
>> >>> discuss it.
>> >>>
>> >>
>> >> The RDF Semantics is explicitly agnostic about interpretations and
>> >> "actual truth".
>> >>
>> >> Owen published a URI definition, and Arthur, Aster and Alfred
>> >> published a bunch of assertions.  Whether anyone "believes" any of
>> >> those assertions, whether those assertions have any bearing on the
>> >> "real world", and whether they are at all useful to anyone's
>> >> applications, are entirely different questions.  AFAICT those
>> >> questions are irrelevant to the technical question of whether Arthur
>> "misused" owl:sameAs.
>> >>
>> >>
>> >>
>> >>>  One might claim that Arthur misused owl:sameAs because Owen had not
>> >>>> specified whether :x was the same or different from :y or :z, and
>> >>>> therefore Arthur had improperly *guessed* about the value of :x's
>> >>>> owl:sameAs property.
>> >>>>
>> >>>> But by that logic, Arthur would not be able to assert *anything*
>> >>>> new about :x.  I.e., Arthur would not be allowed to assert any
>> >>>> property whose value was not already entailed by Owen's definition!
>> >>>>
>> >>>
>> >>> Arthur may add information, of course. But Arthur is responsible for
>> >>> the truth of what he asserts, and part of that responsibility, in
>> >>> practice, is to take care to ascertain what the intended referents
>> >>> are of any URIs published by others, that Arthur then uses in his
>> >>> assertions.
>> >>>
>> >>
>> >> But Arthur, Aster and Alfred were each fully diligent in ensuring
>> >> that their assertions were consistent with all information that Owen
>> provided.
>> >>  What more could they do?
>> >>
>> >>
>> >>  For example, if I (as I recently did) wish to assert that
>> >>> something was red in color, I might use the URI
>> >>>
>> >>> http://linkedopencolors.**moreways.net/color/rgb/ff0000.**html<http:
>> >>> //linkedopencolors.moreways.net/color/rgb/ff0000.html>
>> >>>
>> >>> rather than, say,
>> >>>
>> >>> http://linkedopencolors.**moreways.net/color/rgb/00ff00.**html<http:
>> >>> //linkedopencolors.moreways.net/color/rgb/00ff00.html>
>> >>>
>> >>> because I know, using my color vision (not available to RDF engines)
>> >>> that the first one refers to red and the second one to green, which
>> >>> (I also know) is not red. I *could* use the second URI and insist
>> >>> that I intended it to denote the color red, but that would be
>> >>> stupid, since readers of my RDF will (and indeed should)
>> >>> misunderstand me. If I were to assert that
>> >>>
>> >>> http://linkedopencolors.**moreways.net/color/rgb/00ff00.**html<http:
>> >>> //linkedopencolors.moreways.net/color/rgb/00ff00.html>
>> >>> owl:sameAs
>> >>> http://linkedopencolors.**moreways.net/color/css/red.**html<http://l
>> >>> inkedopencolors.moreways.net/color/css/red.html>
>> >>> .
>> >>>
>> >>> then I would be saying something false. And yes, in that case, it
>> >>> *is* my error, even if what I have said is formally consistent
>> >>> (which it in fact is) with the published RDF "definition" of these
>> >>> URis (which is in fact empty.)
>> >>>
>> >>
>> >> In that example there were additional constraints that were not
>> >> expressed formally -- such as the fact that red and green are
>> >> different colors, and what wavelengths correspond to which colors,
>> >> etc.  But unless you are claiming that assertions expressed in
>> >> natural language can somehow avoid ambiguity where formal assertions
>> >> cannot, then for the sake of analysis we can assume that all assertions
>> have been expressed formally.
>> >>
>> >> I am also assuming that in the vast majority of cases, a URI's
>> >> resource identity will be defined by a description, rather than by
>> >> ostension
>> >> http://plato.stanford.edu/**entries/identity/<http://plato.stanford.e
>> >> du/entries/identity/>
>> >> so I am focusing on that case.
>> >>
>> >>
>> >>
>> >>>  And that would render RDF rather pointless.
>> >>>>
>> >>>
>> >>> Why would it render it pointless? The point of RDF is not to make
>> >>> completely unjustified statements about nothing in particular.
>> >>>
>> >>
>> >> RDF is designed to allow anyone to say anything about anything.  If
>> >> someone chooses to make completely unjustified statements about
>> >> nothing in particular, that is their business.  AFAICT that is
>> >> completely irrelevant to the technical question of whether owl:sameAs
>> was used incorrectly.
>> >>
>> >>
>> >>
>> >>>  Maybe someone can see a way to avoid this dilemma.  Maybe someone
>> >>>> can figure out a way to distinguish between the "essential"
>> >>>> properties that serve to identify a resource, and other
>> >>>> "inessential" properties that the resource might have. If so, and
>> >>>> the number of "essential" properties is finite, then indeed this
>> >>>> problem could be avoided by requiring every URI owner to define all
>> >>>> of the "essential" properties of the URI's denoted resource, or by
>> >>>> prohibiting anyone but the URI owner from asserting any new
>> >>>> "essential" properties of the resource (beyond those the URI owner
>> >>>> had defined).  Or maybe there is another way around this dilemma.
>> >>>>
>> >>>
>> >>> What do you see the "dilemma" here as being, exactly? It seems to me
>> >>> that this is not about RDF as such at all. It is about data, however
>> >>> that data is recorded. People can publish data about things. They do
>> >>> so by making assertions. In an ideal world, everyone is responsible
>> >>> for the assertions they make. Other people can put together
>> >>> information from various sources, but the reliability of the result
>> >>> is hostage to the reliability of all the sources that are used. All
>> >>> this is kind of obvious, but what else is being said in this thread?
>> >>>
>> >>
>> >> The dilemma is that we would like each URI to always denote the same
>> >> thing in all RDF datasets, so that when we merge RDF datasets, the
>> >> merge will make sense: the merge will be consistent and an
>> >> application that worked properly on an individual RDF dataset will
>> >> also work properly on the merge of that dataset with other datasets.
>> >> But because URI definitions are inherently ambiguous, different RDF
>> >> authors will interpret them differently, and this leads to
>> >> inconsistencies when datasets are merged -- even when all parties
>> >> have acted in good faith and have done all that they could reasonably
>> have been expected to do to avoid such conflicts.
>> >>
>> >> Key assumptions:
>> >>
>> >>  1. Owen's URI definition will always be ambiguous.  There will
>> >> always exist a property p such that neither p nor its negation are
>> >> entailed by the URI definition.
>> >>
>> >>  2. Owen cannot be expected to forever refine his URI definition by
>> >> adding disambiguation at the request of every RDF author who uses his
>> >> URIs.  At some point, Owen will reach the point of saying "that's all
>> >> the disambiguation you get".  (This is the point at which the example
>> >> that I gave begins.)
>> >>
>> >>
>> >>
>> >>>
>> >>>> Unless some way around this dilemma is found, it seems unreasonably
>> >>>> judgemental to accuse Arthur of misusing owl:sameAs in this case,
>> >>>>
>> >>>
>> >>> Possibly, yes, but not because...
>> >>>
>> >>>  since he didn't assert anything that was inconsistent with Owen's
>> >>>> URI definition
>> >>>>
>> >>>
>> >>> Consistency is not the point. If I make completely unfounded
>> >>> assertions about a topic that you have introduced, then the fact
>> >>> they might be logically consistent with what you have said is
>> >>> neither here nor there. What matters is whether I have the authority
>> >>> to make the assertions I do, or whether I am lying, fabricating or
>> >>> simply fantasizing using Owen's vocabulary.
>> >>>
>> >>
>> >> Can you translate that into more objective technical terms?  What
>> >> exactly does "unfounded" mean?  And what do you mean by "authority"?
>> >> What objective technical criteria are you suggesting?  And why is it
>> >> relevant to the question of whether Arthur misused owl:sameAs, given
>> >> that the RDF Semantics is explicitly agnostic about interpretations?
>> >>
>> >> David Booth
>> >>
>> >>
>>
>> --
>> Phillip Lord,                           Phone: +44 (0) 191 222 7827
>> Lecturer in Bioinformatics,             Email:
>> phillip.lord@newcastle.ac.uk
>> School of Computing Science,
>> http://homepages.cs.ncl.ac.uk/phillip.lord
>> Room 914 Claremont Tower,               skype: russet_apples
>> Newcastle University,                   twitter: phillord
>> NE1 7RU
>>
>>

-- 
Phillip Lord,                           Phone: +44 (0) 191 222 7827
Lecturer in Bioinformatics,             Email: phillip.lord@newcastle.ac.uk
School of Computing Science,            http://homepages.cs.ncl.ac.uk/phillip.lord
Room 914 Claremont Tower,               skype: russet_apples
Newcastle University,                   twitter: phillord
NE1 7RU
Received on Tuesday, 9 April 2013 15:32:13 UTC