Re: owl:sameAs - Harmful to provenance? from Jim McCusker on 2013-03-27 (public-semweb-lifesci@w3.org from March 2013)

From: Jim McCusker <james.mccusker@yale.edu>
Date: Wed, 27 Mar 2013 15:42:34 -0400
To: Bob Futrelle <bob.futrelle@gmail.com>
Cc: Rafael Richards <rafaelrichards@jhu.edu>, Oliver Ruebenacker <curoli@gmail.com>, David Booth <david@dbooth.org>, "<public-semweb-lifesci@w3.org>" <public-semweb-lifesci@w3.org>
Message-ID: <CAAtgn=Q9uBA2Kva+vbjgC6J+H_a-rE_Uz3JrVwtGqpV8QN_26w@mail.gmail.com>
Which is why PROV exists. Now we have a floor to work from. I've already
integrated it into a number of projects.

Jim


On Wed, Mar 27, 2013 at 3:39 PM, Bob Futrelle <bob.futrelle@gmail.com>wrote:

> Provenance techniques/tools/systems are nowhere near what they could to be.
> Each provenance system or "standard" ends up being unique so the
> information is not inter-operative.
>
> One example among the many: http://openprovenance.org/
>
> These days, I'm more focused on NLP than serious knowledge systems.
> But I find that logging and versioning can allow me generate provenance
> graphs
> if I really need them.  Often a shift in design is enough to blur earlier
> designs
> that did have some good ideas that shouldn't be lost.
>
>  - Bob Futrelle
>    BioNLP.org
>
>
>
> On Wed, Mar 27, 2013 at 1:31 PM, Rafael Richards <rafaelrichards@jhu.edu>wrote:
>
>>  This has been a very prolific thread, but did we discuss provenance?
>>
>>  A slideshare on  owl:sameAs - Harmful to Provenance is here:
>>
>>
>> http://www.slideshare.net/jpmccusker/owlsameas-considered-harmful-to-provenance
>>
>>   Presentation Abstract:
>> GOTO was once a standard operation in most computer programming
>> languages. Edsger Dijkstra argued in 1968 that GOTO is a low level
>> operation that is not appropriate for higher-level programming languages,
>> and advocated structured programming in its place. Arguably, owl:sameAs in
>> its current usage may be poised to go through a similar discussion and
>> transformation period. In biomedical research, the provenance of
>> information gathered is nearly as important as, and sometimes even more
>> important than, the information itself. owl:sameAs allows someone to state
>> that two separate descriptions really refer to the same entity. Currently
>> that means that operational systems merge the descriptions and at the same
>> time, merge the provenance information, thus losing the ability to retrieve
>> where each individual description came from. This merging of provenance can
>> be problematic or even catastrophic in biomedical applications that demand
>> access to provenance information. Based on our knowledge of integration
>> issues of data in biomedicine, we give examples as use cases of this issue
>> in biospecimen management and experimental metadata representations. We
>> suggest that systems using any construct like owl:sameAs must provide an
>> option preserve the provenance of the entities and ground assertions
>> related to those entities in question.
>>
>>
>>  Rafael
>>
>>   *Rafael M. Richards, M.D., M.S.*
>>  *Assistant Professor, *Anesthesiology & Critical Care Medicine****
>> *Faculty, *Division of Health Science Informatics
>>  Johns Hopkins School of Medicine
>>  Baltimore, MD 2224-2760****
>>  rafaelrichards [at] jhu edu
>>
>>
>>
>>  On Mar 27, 2013, at 11:02 AM, Oliver Ruebenacker <curoli@gmail.com>
>>  wrote:
>>
>>     Hello David,
>>
>>  So if I understand your view correctly, then it could be expressed
>> in a language close to yours as:
>>
>>  "Some people believe that if a URI occurs twice within a graph or
>> statement, it refers to the same thing. But this is a myth! RDF never
>> guarantees that two occurrences of the same URI mean the same thing."
>>
>>     Take care
>>     Oliver
>>
>> On Wed, Mar 27, 2013 at 9:37 AM, David Booth <david@dbooth.org> wrote:
>>
>> Hi Oliver,
>>
>> On 03/25/2013 04:02 PM, Oliver Ruebenacker wrote:
>>
>>
>>      Hello David,
>>
>>   We agree that there are different interpretations. But you haven't
>> shown that the boundaries between interpretations are graphs
>> boundaries (others, including me, think that each interpretation is
>> global).
>>
>>
>>
>> I don't know what you mean by "boundaries between interpretations".
>> An interpretation may be applied to any graph or statement to determine
>> its
>> truth value (or to a URI to determine the resource to which it is bound in
>> that interpretation).
>>
>> The notion of a graph boundary is purely a matter of convenience and
>> utility.  A graph can consist of *any* set of RDF triples.  If you wanted,
>> you could apply an interpretation to a graph consisting of three randomly
>> selected triples from each RDF document on the web, but it probably
>> wouldn't
>> be very useful to do so, because you probably would not care about the
>> truth
>> value of that graph.  We generally only apply an interpretation to a graph
>> whose truth value we care about.
>>
>> An interpretation corresponds to the *use* of a graph.  Suppose I have a
>> graph that "ambiguously" uses the same URI to denote both a toucan and its
>> web page, without asserting that toucans cannot be web pages:
>>
>>   @prefix : <http://example/>
>>   :tweety a :Toucan .
>>   :tweety a :WebPage .
>>
>> When a conforming RDF application takes that RDF graph as input, assumes
>> it
>> is true, and produces some output such as "Tweety is a toucan", in effect
>> the application has chosen a particular interpretation to apply to that
>> graph.  In effect, the choice of interpretation causes the app to produce
>> that particular output.  For example, the app might categorize animals
>> into
>> species, choosing an interpretation that maps :tweety to a kind of bird.
>> But a different conforming RDF application that only cares about web page
>> authorship might take that *same* RDF graph as input and choose a
>> different
>> interpretation that maps :tweety to a web page, instead outputting "Tweety
>> is a web page".  In effect, the app has chosen an interpretation that is
>> appropriate for its purpose.
>>
>> If the graph had also asserted :Toucan owl:disjointWith :WebPage, then the
>> graph cannot be true under OWL semantics, and the graph (as is) would be
>> unusable to both apps.
>>
>>
>>   That makes me wonder whether you consider it in conformance with the
>> specs to choose different boundaries?
>>
>>   For example, would you consider it conforming to apply a different
>> interpretation to each statement? Or how about a different
>> interpretation for each node of a statement? Do you see anything in
>> the specs against doing so?
>>
>>
>>
>> Sure it is in conformance with the spec.  An interpretation can be applied
>> to any graph or any RDF statement.  And certainly you could determine the
>> truth value of N different statements according to N different
>> interpretations.  But would it be useful to do so?  Probably not.
>> Furthermore, if two statements are true under two different
>> interpretations,
>> that would not tell you whether a graph consisting of those two statements
>> would be true under a single interpretation.
>>
>> OTOH, it *is* useful to apply different intepretations to different
>> graphs,
>> and one reason is that you may be using those graphs for different
>> applications, each app in effect applying its own interpretation.  But the
>> fact that those graphs may be true under different interpretations does
>> *not* tell you whether the merge of those graphs will be true under a
>> single
>> interpretation.
>>
>> The RDF Semantics spec only tells you how to compute the truth value of
>> one
>> <interpretation, graph> pair at a time, but you can certainly apply it to
>> as
>> many <interpretation, graph> pairs as you want -- in full conformance with
>> the intent of the spec.  This is the same as if I define a function f of
>> two
>> arguments, such that f(x,y) = x+y, that function definition only tells you
>> how to compute f(x,y) for one pair of numbers at a time, but you can
>> certainly apply it to as many pairs as you want, without in any way
>> violating the intent of f's definition.
>>
>> David
>>
>>
>>
>>
>> --
>> IT Project Lead at PanGenX (http://www.pangenx.com)
>> The purpose is always improvement
>>
>>
>>
>


-- 
Jim McCusker
Programmer Analyst
Krauthammer Lab, Pathology Informatics
Yale School of Medicine
james.mccusker@yale.edu | (203) 785-4436
http://krauthammerlab.med.yale.edu

PhD Student
Tetherless World Constellation
Rensselaer Polytechnic Institute
mccusj@cs.rpi.edu
http://tw.rpi.edu
Received on Wednesday, 27 March 2013 19:43:24 UTC