W3C home > Mailing lists > Public > public-semweb-lifesci@w3.org > March 2013

Re: owl:sameAs - Harmful to provenance?

From: Jim McCusker <james.mccusker@yale.edu>
Date: Wed, 27 Mar 2013 19:04:37 -0400
Message-ID: <CAAtgn=Qc6qvvFxmnJxQsOqf91BzE=cz6NPTuHZSxacBHvGgRHA@mail.gmail.com>
To: Bob Futrelle <bob.futrelle@gmail.com>
Cc: Rafael Richards <rafaelrichards@jhu.edu>, Oliver Ruebenacker <curoli@gmail.com>, David Booth <david@dbooth.org>, "<public-semweb-lifesci@w3.org>" <public-semweb-lifesci@w3.org>
PROV is now a Proposed Recommendation, which means the model has been
frozen for quite some time. It is also potentially very lightweight, you
don't have to use all of it to gain benefits from it. Simple derivation
graphs can be composed using prov:wasDerivedFrom, and there are many
properties, such as prov:wasQuotedFrom and prov:wasAttributedTo, that are
incredibly useful just by themselves. I use it as a core ontology partly
because my view of biomedical entities ends up looking very much like
provenance anyway, so it helps me keep things interoperable from the start.


On Wed, Mar 27, 2013 at 7:00 PM, Bob Futrelle <bob.futrelle@gmail.com>wrote:

>  So I assume that you feel that PROV is in a state that you can start
> building conforming tools.
>  SInce I'm the sole developer of my NLP system, I just don't have the
> time to devote to something as big and "heavy" as PROV. But I'm happy to
> see the work going on.  Somewhere down the road I will have to concern
> myself with provenance, but more in terms of the creation and dispersion of
> basic science knowledge than things directly impinging on medically related
> knowledge.
>  Thanks for alerting me to PROV.
>   - Bob
> On Wed, Mar 27, 2013 at 3:42 PM, Jim McCusker <james.mccusker@yale.edu>wrote:
>> Which is why PROV exists. Now we have a floor to work from. I've already
>> integrated it into a number of projects.
>>  Jim
>> On Wed, Mar 27, 2013 at 3:39 PM, Bob Futrelle <bob.futrelle@gmail.com>wrote:
>>> Provenance techniques/tools/systems are nowhere near what they could to
>>> be.
>>> Each provenance system or "standard" ends up being unique so the
>>> information is not inter-operative.
>>>  One example among the many: http://openprovenance.org/
>>>  These days, I'm more focused on NLP than serious knowledge systems.
>>> But I find that logging and versioning can allow me generate provenance
>>> graphs
>>> if I really need them.  Often a shift in design is enough to blur
>>> earlier designs
>>> that did have some good ideas that shouldn't be lost.
>>>   - Bob Futrelle
>>>    BioNLP.org
>>>  On Wed, Mar 27, 2013 at 1:31 PM, Rafael Richards <
>>> rafaelrichards@jhu.edu> wrote:
>>>>  This has been a very prolific thread, but did we discuss provenance?
>>>>  A slideshare on  owl:sameAs - Harmful to Provenance is here:
>>>> http://www.slideshare.net/jpmccusker/owlsameas-considered-harmful-to-provenance
>>>>   Presentation Abstract:
>>>> GOTO was once a standard operation in most computer programming
>>>> languages. Edsger Dijkstra argued in 1968 that GOTO is a low level
>>>> operation that is not appropriate for higher-level programming languages,
>>>> and advocated structured programming in its place. Arguably, owl:sameAs in
>>>> its current usage may be poised to go through a similar discussion and
>>>> transformation period. In biomedical research, the provenance of
>>>> information gathered is nearly as important as, and sometimes even more
>>>> important than, the information itself. owl:sameAs allows someone to state
>>>> that two separate descriptions really refer to the same entity. Currently
>>>> that means that operational systems merge the descriptions and at the same
>>>> time, merge the provenance information, thus losing the ability to retrieve
>>>> where each individual description came from. This merging of provenance can
>>>> be problematic or even catastrophic in biomedical applications that demand
>>>> access to provenance information. Based on our knowledge of integration
>>>> issues of data in biomedicine, we give examples as use cases of this issue
>>>> in biospecimen management and experimental metadata representations. We
>>>> suggest that systems using any construct like owl:sameAs must provide an
>>>> option preserve the provenance of the entities and ground assertions
>>>> related to those entities in question.
>>>>  Rafael
>>>>   *Rafael M. Richards, M.D., M.S.*
>>>>  *Assistant Professor, *Anesthesiology & Critical Care Medicine****
>>>> *Faculty, *Division of Health Science Informatics
>>>>  Johns Hopkins School of Medicine
>>>>  Baltimore, MD 2224-2760****
>>>>  rafaelrichards [at] jhu edu
>>>>  On Mar 27, 2013, at 11:02 AM, Oliver Ruebenacker <curoli@gmail.com>
>>>>  wrote:
>>>>     Hello David,
>>>>  So if I understand your view correctly, then it could be expressed
>>>> in a language close to yours as:
>>>>  "Some people believe that if a URI occurs twice within a graph or
>>>> statement, it refers to the same thing. But this is a myth! RDF never
>>>> guarantees that two occurrences of the same URI mean the same thing."
>>>>     Take care
>>>>     Oliver
>>>> On Wed, Mar 27, 2013 at 9:37 AM, David Booth <david@dbooth.org> wrote:
>>>> Hi Oliver,
>>>> On 03/25/2013 04:02 PM, Oliver Ruebenacker wrote:
>>>>      Hello David,
>>>>   We agree that there are different interpretations. But you haven't
>>>> shown that the boundaries between interpretations are graphs
>>>> boundaries (others, including me, think that each interpretation is
>>>> global).
>>>> I don't know what you mean by "boundaries between interpretations".
>>>> An interpretation may be applied to any graph or statement to determine
>>>> its
>>>> truth value (or to a URI to determine the resource to which it is bound
>>>> in
>>>> that interpretation).
>>>> The notion of a graph boundary is purely a matter of convenience and
>>>> utility.  A graph can consist of *any* set of RDF triples.  If you
>>>> wanted,
>>>> you could apply an interpretation to a graph consisting of three
>>>> randomly
>>>> selected triples from each RDF document on the web, but it probably
>>>> wouldn't
>>>> be very useful to do so, because you probably would not care about the
>>>> truth
>>>> value of that graph.  We generally only apply an interpretation to a
>>>> graph
>>>> whose truth value we care about.
>>>> An interpretation corresponds to the *use* of a graph.  Suppose I have a
>>>> graph that "ambiguously" uses the same URI to denote both a toucan and
>>>> its
>>>> web page, without asserting that toucans cannot be web pages:
>>>>   @prefix : <http://example/>
>>>>   :tweety a :Toucan .
>>>>   :tweety a :WebPage .
>>>> When a conforming RDF application takes that RDF graph as input,
>>>> assumes it
>>>> is true, and produces some output such as "Tweety is a toucan", in
>>>> effect
>>>> the application has chosen a particular interpretation to apply to that
>>>> graph.  In effect, the choice of interpretation causes the app to
>>>> produce
>>>> that particular output.  For example, the app might categorize animals
>>>> into
>>>> species, choosing an interpretation that maps :tweety to a kind of bird.
>>>> But a different conforming RDF application that only cares about web
>>>> page
>>>> authorship might take that *same* RDF graph as input and choose a
>>>> different
>>>> interpretation that maps :tweety to a web page, instead outputting
>>>> "Tweety
>>>> is a web page".  In effect, the app has chosen an interpretation that is
>>>> appropriate for its purpose.
>>>> If the graph had also asserted :Toucan owl:disjointWith :WebPage, then
>>>> the
>>>> graph cannot be true under OWL semantics, and the graph (as is) would be
>>>> unusable to both apps.
>>>>   That makes me wonder whether you consider it in conformance with the
>>>> specs to choose different boundaries?
>>>>   For example, would you consider it conforming to apply a different
>>>> interpretation to each statement? Or how about a different
>>>> interpretation for each node of a statement? Do you see anything in
>>>> the specs against doing so?
>>>> Sure it is in conformance with the spec.  An interpretation can be
>>>> applied
>>>> to any graph or any RDF statement.  And certainly you could determine
>>>> the
>>>> truth value of N different statements according to N different
>>>> interpretations.  But would it be useful to do so?  Probably not.
>>>> Furthermore, if two statements are true under two different
>>>> interpretations,
>>>> that would not tell you whether a graph consisting of those two
>>>> statements
>>>> would be true under a single interpretation.
>>>> OTOH, it *is* useful to apply different intepretations to different
>>>> graphs,
>>>> and one reason is that you may be using those graphs for different
>>>> applications, each app in effect applying its own interpretation.  But
>>>> the
>>>> fact that those graphs may be true under different interpretations does
>>>> *not* tell you whether the merge of those graphs will be true under a
>>>> single
>>>> interpretation.
>>>> The RDF Semantics spec only tells you how to compute the truth value of
>>>> one
>>>> <interpretation, graph> pair at a time, but you can certainly apply it
>>>> to as
>>>> many <interpretation, graph> pairs as you want -- in full conformance
>>>> with
>>>> the intent of the spec.  This is the same as if I define a function f
>>>> of two
>>>> arguments, such that f(x,y) = x+y, that function definition only tells
>>>> you
>>>> how to compute f(x,y) for one pair of numbers at a time, but you can
>>>> certainly apply it to as many pairs as you want, without in any way
>>>> violating the intent of f's definition.
>>>> David
>>>> --
>>>> IT Project Lead at PanGenX (http://www.pangenx.com)
>>>> The purpose is always improvement
>>   --
>> Jim McCusker
>> Programmer Analyst
>> Krauthammer Lab, Pathology Informatics
>> Yale School of Medicine
>> james.mccusker@yale.edu | (203) 785-4436
>> http://krauthammerlab.med.yale.edu
>> PhD Student
>> Tetherless World Constellation
>> Rensselaer Polytechnic Institute
>> mccusj@cs.rpi.edu
>> http://tw.rpi.edu

Jim McCusker
Programmer Analyst
Krauthammer Lab, Pathology Informatics
Yale School of Medicine
james.mccusker@yale.edu | (203) 785-4436

PhD Student
Tetherless World Constellation
Rensselaer Polytechnic Institute
Received on Wednesday, 27 March 2013 23:05:26 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:53:02 UTC