Re: owl:sameAs - Harmful to provenance? from Alan Ruttenberg on 2013-04-08 (public-semweb-lifesci@w3.org from April 2013)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Mon, 8 Apr 2013 14:25:20 -0400
To: "Bhat, Talapady N." <talapady.bhat@nist.gov>
Cc: Phillip Lord <phillip.lord@newcastle.ac.uk>, Oliver Ruebenacker <curoli@gmail.com>, David Booth <david@dbooth.org>, Pat Hayes <phayes@ihmc.us>, Peter Ansell <ansell.peter@gmail.com>, public-semweb-lifesci <public-semweb-lifesci@w3.org>
Message-ID: <CAFKQJ8m5Mwa00FDZYoF8-OqyGbjp+nLxhdHWKeX2dX_-d+Jo0w@mail.gmail.com>
On Mon, Apr 8, 2013 at 2:20 PM, Bhat, Talapady N. <talapady.bhat@nist.gov>wrote:

> Alan,****
>
> Thanks for your comments.****
>
> ** **
>
> I want to mention a quote from a RDA meeting in Gothenburg two weeks back.
> ****
>
> ** **
>
> Peter Fox (Director Information Technology, Rensselaer Polytechnic
> Institute):   The term ‘metadata’ is used when one does not what one is
> talking about. The term ‘metadata’ is a thing of the past. Instead, now,
> one needs to talk about data and documenting data using use-cases with
> re-usable terms.****
>
> ** **
>
> You are right. RDF instead focuses on data and linking data the data you
> care about. This may not be perfect as you said. But one need to remember
> what were other possibilities? Free text with zero hierarchies or zero
> relationships or chaotic ad hoc XML?
>

If the only alternative is chaotic ad hoc RDF, then I don't see the point.
Time is better spend on addressing unsolved problems.

-Alan


> ****
>
> ** **
>
> T N Bhat****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Alan Ruttenberg [mailto:alanruttenberg@gmail.com]
> *Sent:* Monday, April 08, 2013 1:24 PM
> *To:* Bhat, Talapady N.
> *Cc:* Phillip Lord; Oliver Ruebenacker; David Booth; Pat Hayes; Peter
> Ansell; public-semweb-lifesci
>
> *Subject:* Re: owl:sameAs - Harmful to provenance?****
>
> ** **
>
> Nicely pointed out, TN.****
>
> ** **
>
> Thinking about "metadata" as some other category of data is usually a bad
> sign. I've often found it to mean, in practice, "data I care less about".*
> ***
>
> ** **
>
> Phil, to make the case that RDF helps here, we would want to compare how
> easy it is to do significant work using the ill-represented examples you
> find versus raw text, versus xml, versus tab-delimited files. While there
> is some limited benefit to getting rid of the surface syntax problem, it's
> not clear how much of a problem that ever was.****
>
> ** **
>
> -Alan****
>
> ** **
>
> On Mon, Apr 8, 2013 at 1:16 PM, Bhat, Talapady N. <talapady.bhat@nist.gov>
> wrote:****
>
> Hi,
> -----
>
> Introduction -Dublin Core:
> The Dublin Core Metadata Element Set is a vocabulary of fifteen properties
> for use in resource description. The name "Dublin" is due to its origin at
> a 1995 invitational workshop in Dublin, Ohio; "core" because its elements
> are broad and generic, usable for describing a wide range of resources.
>
> The fifteen element "Dublin Core" described in this standard is part of a
> larger set of metadata vocabularies
> --------------------------------------
> As per the introduction (given above) section of doubling core (
> http://dublincore.org/documents/dces/) its focus is primarily metadata
> whereas the actual author names mentioned below probably need be considered
> as 'data'. I do not think Dublin core has really focused on building
> standard re-usable vocabulary for 'data'. That is the real problem. That is
> why we have been focusing on re-usable terms for 'data'
>
> http://www.biomedcentral.com/1471-2105/12/487  and
> http://xpdb.nist.gov/chemblast/pdb.pl  and
> http://www.nature.com/nmeth/journal/v9/n7/abs/nmeth.2084.html
>
> T N Bhat****
>
>
> -----Original Message-----
> From: Phillip Lord [mailto:phillip.lord@newcastle.ac.uk]
> Sent: Monday, April 08, 2013 12:53 PM
> To: Oliver Ruebenacker
> Cc: David Booth; Pat Hayes; Peter Ansell; Alan Ruttenberg;
> public-semweb-lifesci
> Subject: Re: owl:sameAs - Harmful to provenance?
>
>
> And it is this bit -- "before we can do anything useful" that is utterly
> wrong.
>
> Recently I have spent a lot of time look at Dublin Core creator fields.
> You could not believe how many different ways they are used. String
> literals ("Phillip Lord"), last-first ("Lord, Phillip"), with abbrevs ("P.
> Lord"), multi-author ("Phillip Lord; Lindsay Marshall"), with titles ("Dr
> Phillip Lord") and so on.
>
> So, is everyone using Dublin Core wrong? It is useless till everyone uses
> it the same way? Emphatically no, it is not useless.
>
> Would it better if everybody did use it the same way? The answer is
> probably not. Names are incredibly complex, and representing them is, in
> turn, difficult and hard. Any specificiation which did full justice to all
> the different name forms in existance would be incredibly long-winded. Many
> people using the specification would get it wrong; or you could have a
> mechanism for ensuring people always used it correctly.
> Then I am sure that both people who ended up using this form of spec would
> have great fun integrating their tiny datasets.
>
> In the example, we have a number of sets of assertions which individually
> fulfil their creators use-cases. Then, when they are bought together, the
> assertions become inconsistent, telling you up front that there is work to
> be done. And you ask in what way is this useful?
>
> Perfection is the enemy of Good.
>
>
>
> Oliver Ruebenacker <curoli@gmail.com> writes:
> >   So what most people here are saying is that before we can do
> > anything useful, we need to make sure that if two assertions use the
> > same reference, they mean the same thing.
> >
> >   To which you respond that you will accept assertions without
> > assuming that same references mean same things. You will just keep them
> separate.
> > There is no rule against that.
> >
> >   But in what way is this useful?
> >
> >      Take care
> >      Oliver
> >
> > On Mon, Apr 8, 2013 at 10:07 AM, David Booth <david@dbooth.org> wrote:
> >
> >> Hi Pat,
> >>
> >>
> >> On 04/04/2013 02:03 AM, Pat Hayes wrote:
> >>
> >>>
> >>> On Apr 3, 2013, at 9:00 PM, Peter Ansell wrote:
> >>>
> >>>  On 4 April 2013 11:58, David Booth <david@dbooth.org> wrote: On
> >>>> 04/02/2013 05:02 PM, Alan Ruttenberg wrote: On Tuesday, April 2,
> >>>> 2013, David Booth wrote: On 03/27/2013 10:56 PM, Pat Hayes wrote:
> >>>> On Mar 27, 2013, at 7:32 PM, Jim McCusker wrote:
> >>>>
> >>>> If only owl:sameAs were used correctly...
> >>>>
> >>>> Well, I agree that is a problem, but don't draw the conclusion that
> >>>> there is something wrong with sameAs, just because people keep
> >>>> using it wrong.
> >>>>
> >>>> Agreed.  And furthermore, don't draw the conclusion that someone
> >>>> has used owl:sameAs wrong just because you get garbage when you
> >>>> merge two graphs that individually worked just fine.  Those two
> >>>> graphs may have been written assuming different sets of
> >>>> interpretations.
> >>>>
> >>>> In that case I would certainly conclude that they have used it
> >>>> wrong. Have you not been reading what Pat and I have been writing?
> >>>>
> >>>> I've read lots of what you and Pat have written.  And I've learned
> >>>> a lot from it -- particularly in learning about ambiguity from Pat.
> >>>> And I'm in full agreement that owl:sameAs is *often* misused.
> >>>>
> >>>> But I don't believe that getting garbage when merging two graphs
> >>>> that individually worked fine *necessarily* indicates that
> >>>> owl:sameAs was misused -- even when it appears on the surface to be
> >>>> causing the problem.
> >>>>
> >>>
> >>> I agree, but not with your example and your analysis of it.
> >>>
> >>>  Here's a simple example to illustrate.
> >>>>
> >>>> Using the following prefixes throughout, for brevity:
> >>>>
> >>>> @prefix :    <http://example/owen/> . @prefix owl:
> >>>> <http://www.w3.org/2002/07/**owl# <http://www.w3.org/2002/07/**owl> <
> http://www.w3.org/2002/07/owl# <http://www.w3.org/2002/07/owl>>> .
> >>>>
> >>>> Suppose that Owen is the URI owner of :x, :y and :z, and Owen
> >>>> defines them as follows:
> >>>>
> >>>> # Owen's URI definition for :x, :y and :z :x a :Something . :y a
> >>>> :Something . :z a :Something .
> >>>>
> >>>> That's all.  That's Owen's entire definition of those URIs.
> >>>> Obviously this definition is "ambiguous" in some sense.  But as we
> >>>> know, ambiguity is ultimately inescapable anyway, so I have merely
> >>>> chosen an example that makes the ambiguity obvious. As the RDF
> >>>> Semantics spec puts it: "It is usually impossible to assert enough
> >>>> in any language to completely constrain the interpretations to a
> >>>> single possible world".
> >>>>
> >>>
> >>> Yes, but by making the ambiguity this "obvious", you have rendered
> >>> the example pointless. There is *no* content here *at all*, so Owen
> >>> has not really published anything. This is not typical of published
> >>> content, even in RDF. Typically, in fact, there is, as well as some
> >>> nontrivial actual RDF content, some kind of explanation, perhaps in
> >>> natural language, of what the *intended* content of the formal RDF
> >>> is supposed to be. While an RDF engine cannot of course make use of
> >>> such intuitive explanations, other authors of RDF can, and should,
> >>> make use of it to try to ensure that they do not make assertions
> >>> which would be counter to the referential intentions of the original
> >>> authors. For example, the Dublin Core URIs were published with
> >>> almost no formal RDF axioms, but quite elaborate natural language
> >>> glosses which enable them to be used in formal RDF with considerable
> success.
> >>> The fact that formal (and even informal) data is inherently
> >>> ambiguous does not mean that it is inherently, or even typically,
> vacuous.
> >>>
> >>
> >> This seems to suggest that natural language can somehow eliminate
> >> ambiguity, where formal languages cannot.  I don't buy that.
> >> Presumably whatever definition one expressed in natural language
> >> could be expressed in a formal language -- in principle at least.
> >> And certainly the goal of the semantic web is to have such
> >> information expressed in a formal language that is amenable to machine
> processing.
> >>
> >> More precisely, the basic assumption I am making is that for (almost)
> >> any definition there exists a property such that neither that
> >> property nor its negation are entailed by the definition.  I.e.,
> >> there is always more than can be said about the thing whose identity
> >> is defined.  Maybe that assumption is wrong; I don't know.  If you
> >> think it's wrong, I'd be interested in hearing why.
> >>
> >> The example may not be "realistic", but it is *not* pointless.  The
> >> whole point of choosing such a simple example is to expose the
> >> fundamental issues outright, rather than obscuring them in complexity
> >> that we cannot fully understand.  If there is some fundamental reason
> >> why you think this problem cannot happen in a more "realistic"
> >> example, then please explain what mechanism would come into play to
> prevent it.
> >>
> >>
> >>
> >>>  Arthur, an RDF author, publishes the following graph, G1, making
> >>>> certain assumptions about the interpretations that will be applied
> >>>> to it:
> >>>>
> >>>> # G1 :x owl:sameAs :y .
> >>>>
> >>>
> >>> On what basis does Arthur make this assertion? The URIs were coined
> >>> by Owen, and Owen says nothing that would sanction this assumption.
> >>>
> >>
> >> Why Arthur or anyone else chooses to assert whatever they choose to
> >> assert is their business.  It is irrelevant to this analysis.
> >>
> >>
> >>
> >>>  Aster, another RDF author, publishes the following graph, G2,
> >>>> making certain other assumptions about the interpretations that
> >>>> will be applied to it:
> >>>>
> >>>> # G2 :x owl:differentFrom :z .
> >>>>
> >>>> Alfred, a third RDF author, publishes the following graph, G3,
> >>>> making still other assumptions about the interpretations that will
> >>>> be applied to it:
> >>>>
> >>>> # G3 :y owl:differentFrom :z .
> >>>>
> >>>
> >>> Similarly for the other two. They are making assertions using names
> >>> that belong to, and were coined by, another author without having
> >>> any possible source of justification for these nontrivial claims.
> >>> This should not be regarded as good practice, to put it mildly.
> >>>
> >>
> >> Ditto.  If you are claiming that an RDF author needs some sort of
> >> "justification" to make assertions, then please explain exactly what
> >> you mean -- preferably in formal terms -- by "justification".  E.g.,
> >> does "justification" mean that Arthur may only make assertions that
> >> are entailed by Owen's definition?  I already discussed that
> possibility below.
> >>
> >>
> >>
> >>>  Note that G1, G2 and G3 are all individually consistent with Owen's
> >>>> URI definition.  Furthermore, G1, G2 and G3 are all pair-wise
> >>>> consistent: there exists at least one satisfying interpretation for
> >>>> the merge of each pair.  But the merge of G1, G2 and G3 is not
> >>>> consistent:
> >>>>
> >>>
> >>> This kind of behavior is of course quite typical in any assertional
> >>> language.
> >>>
> >>
> >> Yes.
> >>
> >>
> >>
> >>>  Arthur, Aster and Alfred made different assumptions about the set
> >>>> of interpretations that would be applied to their graphs, and the
> >>>> intersection of those sets was empty.
> >>>>
> >>>> Did Arthur misuse owl:sameAs?   What if Aster never published G2?
> >>>> How could Aster's graph possibly affect the question of whether
> >>>> *Arthur* misused owl:sameAs?  It would be nonsensical to assume
> >>>> that it could.
> >>>>
> >>>
> >>> Why? Surely if Aster had a more reliable access to the primary
> >>> source of information about these enigmatic thingies than Arthur
> >>> did, then it might well be the case that Aster's publication could
> >>> reveal errors in Arthur's, by contradicting him.
> >>>
> >>
> >> What do you mean by "more reliable"?  Both Arthur and Aster had
> >> access to the exact same URI definition from Owen.  Are you
> >> suggesting that Arthur and/or Aster should have used a *different*
> >> URI definition?  If so, what definition and why?
> >>
> >>
> >>>  What if Owen later said that Arthur was correct, that :x == :y ?
> >>>> What if he later said the opposite?  Again, it would seem rather
> >>>> bizarre to say that the determination of whether Arthur had misused
> >>>> owl:sameAs could be changed -- long after Arthur had written G1 --
> >>>> by Owen's later statements.
> >>>>
> >>>
> >>> Again, I don't find this bizarre in the least. It might be, if there
> >>> was no truth of the matter concerning all this stuff, so that all
> >>>
> >>> these assertions were made independently with equal (or equal lack
> >>> of) authority as to their actual truth. But that is so implausible
> >>> and artificial an assumption that I don't see why we need to even
> >>> discuss it.
> >>>
> >>
> >> The RDF Semantics is explicitly agnostic about interpretations and
> >> "actual truth".
> >>
> >> Owen published a URI definition, and Arthur, Aster and Alfred
> >> published a bunch of assertions.  Whether anyone "believes" any of
> >> those assertions, whether those assertions have any bearing on the
> >> "real world", and whether they are at all useful to anyone's
> >> applications, are entirely different questions.  AFAICT those
> >> questions are irrelevant to the technical question of whether Arthur
> "misused" owl:sameAs.
> >>
> >>
> >>
> >>>  One might claim that Arthur misused owl:sameAs because Owen had not
> >>>> specified whether :x was the same or different from :y or :z, and
> >>>> therefore Arthur had improperly *guessed* about the value of :x's
> >>>> owl:sameAs property.
> >>>>
> >>>> But by that logic, Arthur would not be able to assert *anything*
> >>>> new about :x.  I.e., Arthur would not be allowed to assert any
> >>>> property whose value was not already entailed by Owen's definition!
> >>>>
> >>>
> >>> Arthur may add information, of course. But Arthur is responsible for
> >>> the truth of what he asserts, and part of that responsibility, in
> >>> practice, is to take care to ascertain what the intended referents
> >>> are of any URIs published by others, that Arthur then uses in his
> >>> assertions.
> >>>
> >>
> >> But Arthur, Aster and Alfred were each fully diligent in ensuring
> >> that their assertions were consistent with all information that Owen
> provided.
> >>  What more could they do?
> >>
> >>
> >>  For example, if I (as I recently did) wish to assert that
> >>> something was red in color, I might use the URI
> >>>
> >>> http://linkedopencolors.**moreways.net/color/rgb/ff0000.**html<http:
> >>> //linkedopencolors.moreways.net/color/rgb/ff0000.html>
> >>>
> >>> rather than, say,
> >>>
> >>> http://linkedopencolors.**moreways.net/color/rgb/00ff00.**html<http:
> >>> //linkedopencolors.moreways.net/color/rgb/00ff00.html>
> >>>
> >>> because I know, using my color vision (not available to RDF engines)
> >>> that the first one refers to red and the second one to green, which
> >>> (I also know) is not red. I *could* use the second URI and insist
> >>> that I intended it to denote the color red, but that would be
> >>> stupid, since readers of my RDF will (and indeed should)
> >>> misunderstand me. If I were to assert that
> >>>
> >>> http://linkedopencolors.**moreways.net/color/rgb/00ff00.**html<http:
> >>> //linkedopencolors.moreways.net/color/rgb/00ff00.html>
> >>> owl:sameAs
> >>> http://linkedopencolors.**moreways.net/color/css/red.**html<http://l
> >>> inkedopencolors.moreways.net/color/css/red.html>
> >>> .
> >>>
> >>> then I would be saying something false. And yes, in that case, it
> >>> *is* my error, even if what I have said is formally consistent
> >>> (which it in fact is) with the published RDF "definition" of these
> >>> URis (which is in fact empty.)
> >>>
> >>
> >> In that example there were additional constraints that were not
> >> expressed formally -- such as the fact that red and green are
> >> different colors, and what wavelengths correspond to which colors,
> >> etc.  But unless you are claiming that assertions expressed in
> >> natural language can somehow avoid ambiguity where formal assertions
> >> cannot, then for the sake of analysis we can assume that all assertions
> have been expressed formally.
> >>
> >> I am also assuming that in the vast majority of cases, a URI's
> >> resource identity will be defined by a description, rather than by
> >> ostension
> >> http://plato.stanford.edu/**entries/identity/<http://plato.stanford.e
> >> du/entries/identity/>
> >> so I am focusing on that case.
> >>
> >>
> >>
> >>>  And that would render RDF rather pointless.
> >>>>
> >>>
> >>> Why would it render it pointless? The point of RDF is not to make
> >>> completely unjustified statements about nothing in particular.
> >>>
> >>
> >> RDF is designed to allow anyone to say anything about anything.  If
> >> someone chooses to make completely unjustified statements about
> >> nothing in particular, that is their business.  AFAICT that is
> >> completely irrelevant to the technical question of whether owl:sameAs
> was used incorrectly.
> >>
> >>
> >>
> >>>  Maybe someone can see a way to avoid this dilemma.  Maybe someone
> >>>> can figure out a way to distinguish between the "essential"
> >>>> properties that serve to identify a resource, and other
> >>>> "inessential" properties that the resource might have. If so, and
> >>>> the number of "essential" properties is finite, then indeed this
> >>>> problem could be avoided by requiring every URI owner to define all
> >>>> of the "essential" properties of the URI's denoted resource, or by
> >>>> prohibiting anyone but the URI owner from asserting any new
> >>>> "essential" properties of the resource (beyond those the URI owner
> >>>> had defined).  Or maybe there is another way around this dilemma.
> >>>>
> >>>
> >>> What do you see the "dilemma" here as being, exactly? It seems to me
> >>> that this is not about RDF as such at all. It is about data, however
> >>> that data is recorded. People can publish data about things. They do
> >>> so by making assertions. In an ideal world, everyone is responsible
> >>> for the assertions they make. Other people can put together
> >>> information from various sources, but the reliability of the result
> >>> is hostage to the reliability of all the sources that are used. All
> >>> this is kind of obvious, but what else is being said in this thread?
> >>>
> >>
> >> The dilemma is that we would like each URI to always denote the same
> >> thing in all RDF datasets, so that when we merge RDF datasets, the
> >> merge will make sense: the merge will be consistent and an
> >> application that worked properly on an individual RDF dataset will
> >> also work properly on the merge of that dataset with other datasets.
> >> But because URI definitions are inherently ambiguous, different RDF
> >> authors will interpret them differently, and this leads to
> >> inconsistencies when datasets are merged -- even when all parties
> >> have acted in good faith and have done all that they could reasonably
> have been expected to do to avoid such conflicts.
> >>
> >> Key assumptions:
> >>
> >>  1. Owen's URI definition will always be ambiguous.  There will
> >> always exist a property p such that neither p nor its negation are
> >> entailed by the URI definition.
> >>
> >>  2. Owen cannot be expected to forever refine his URI definition by
> >> adding disambiguation at the request of every RDF author who uses his
> >> URIs.  At some point, Owen will reach the point of saying "that's all
> >> the disambiguation you get".  (This is the point at which the example
> >> that I gave begins.)
> >>
> >>
> >>
> >>>
> >>>> Unless some way around this dilemma is found, it seems unreasonably
> >>>> judgemental to accuse Arthur of misusing owl:sameAs in this case,
> >>>>
> >>>
> >>> Possibly, yes, but not because...
> >>>
> >>>  since he didn't assert anything that was inconsistent with Owen's
> >>>> URI definition
> >>>>
> >>>
> >>> Consistency is not the point. If I make completely unfounded
> >>> assertions about a topic that you have introduced, then the fact
> >>> they might be logically consistent with what you have said is
> >>> neither here nor there. What matters is whether I have the authority
> >>> to make the assertions I do, or whether I am lying, fabricating or
> >>> simply fantasizing using Owen's vocabulary.
> >>>
> >>
> >> Can you translate that into more objective technical terms?  What
> >> exactly does "unfounded" mean?  And what do you mean by "authority"?
> >> What objective technical criteria are you suggesting?  And why is it
> >> relevant to the question of whether Arthur misused owl:sameAs, given
> >> that the RDF Semantics is explicitly agnostic about interpretations?
> >>
> >> David Booth
> >>
> >>
>
> --
> Phillip Lord,                           Phone: +44 (0) 191 222 7827
> Lecturer in Bioinformatics,             Email:
> phillip.lord@newcastle.ac.uk
> School of Computing Science,
> http://homepages.cs.ncl.ac.uk/phillip.lord
> Room 914 Claremont Tower,               skype: russet_apples
> Newcastle University,                   twitter: phillord
> NE1 7RU****
>
> ** **
>
Received on Monday, 8 April 2013 18:26:26 UTC