RE: owl:sameAs - Harmful to provenance? from Robert Stanley on 2013-03-27 (public-semweb-lifesci@w3.org from March 2013)

From: Robert Stanley <rstanley@io-informatics.com>
Date: Wed, 27 Mar 2013 13:11:20 -0700
To: "Jim McCusker" <james.mccusker@yale.edu>
Cc: <public-semweb-lifesci@w3.org>
Message-ID: <08AE3015BD5DF149951910A5E851F010010A37DC@MAIL-02.io-informatics.com>
 

 

It's also encouraging that w3c HCLS is focusing actively on the
provenance discussion. We've found (and presented on) practical benefits
from use of PROV-O and VoID.

 

All the best,

 

Bob

 

From: Jim McCusker [mailto:james.mccusker@yale.edu] 
Sent: Wednesday, March 27, 2013 12:43 PM
To: Bob Futrelle
Cc: Rafael Richards; Oliver Ruebenacker; David Booth;
<public-semweb-lifesci@w3.org>
Subject: Re: owl:sameAs - Harmful to provenance?

 

Which is why PROV exists. Now we have a floor to work from. I've already
integrated it into a number of projects.

 

Jim

 

On Wed, Mar 27, 2013 at 3:39 PM, Bob Futrelle <bob.futrelle@gmail.com>
wrote:

Provenance techniques/tools/systems are nowhere near what they could to
be.

Each provenance system or "standard" ends up being unique so the
information is not inter-operative.

 

One example among the many: http://openprovenance.org/

 

These days, I'm more focused on NLP than serious knowledge systems.

But I find that logging and versioning can allow me generate provenance
graphs

if I really need them.  Often a shift in design is enough to blur
earlier designs

that did have some good ideas that shouldn't be lost.

 

 - Bob Futrelle

   BioNLP.org

 

 

On Wed, Mar 27, 2013 at 1:31 PM, Rafael Richards
<rafaelrichards@jhu.edu> wrote:

	This has been a very prolific thread, but did we discuss
provenance?

	 

	A slideshare on  owl:sameAs - Harmful to Provenance is here:

	 

	
http://www.slideshare.net/jpmccusker/owlsameas-considered-harmful-to-pro
venance 

	 

	Presentation Abstract:
	GOTO was once a standard operation in most computer programming
languages. Edsger Dijkstra argued in 1968 that GOTO is a low level
operation that is not appropriate for higher-level programming
languages, and advocated structured programming in its place. Arguably,
owl:sameAs in its current usage may be poised to go through a similar
discussion and transformation period. In biomedical research, the
provenance of information gathered is nearly as important as, and
sometimes even more important than, the information itself. owl:sameAs
allows someone to state that two separate descriptions really refer to
the same entity. Currently that means that operational systems merge the
descriptions and at the same time, merge the provenance information,
thus losing the ability to retrieve where each individual description
came from. This merging of provenance can be problematic or even
catastrophic in biomedical applications that demand access to provenance
information. Based on our knowledge of integration issues of data in
biomedicine, we give examples as use cases of this issue in biospecimen
management and experimental metadata representations. We suggest that
systems using any construct like owl:sameAs must provide an option
preserve the provenance of the entities and ground assertions related to
those entities in question.

	 

	 

	Rafael

	 

	Rafael M. Richards, M.D., M.S.

	Assistant Professor, Anesthesiology & Critical Care Medicine

	Faculty, Division of Health Science Informatics

	Johns Hopkins School of Medicine

	Baltimore, MD 2224-2760

	rafaelrichards [at] jhu edu

	 

	 

	On Mar 27, 2013, at 11:02 AM, Oliver Ruebenacker
<curoli@gmail.com>

	 wrote:

	
	
	

	    Hello David,
	
	 So if I understand your view correctly, then it could be
expressed
	in a language close to yours as:
	
	 "Some people believe that if a URI occurs twice within a graph
or
	statement, it refers to the same thing. But this is a myth! RDF
never
	guarantees that two occurrences of the same URI mean the same
thing."
	
	    Take care
	    Oliver
	
	On Wed, Mar 27, 2013 at 9:37 AM, David Booth <david@dbooth.org>
wrote:
	
	

	Hi Oliver,
	
	On 03/25/2013 04:02 PM, Oliver Ruebenacker wrote:
	
	

	
	     Hello David,
	
	  We agree that there are different interpretations. But you
haven't
	shown that the boundaries between interpretations are graphs
	boundaries (others, including me, think that each interpretation
is
	global).

	
	
	I don't know what you mean by "boundaries between
interpretations".
	An interpretation may be applied to any graph or statement to
determine its
	truth value (or to a URI to determine the resource to which it
is bound in
	that interpretation).
	
	The notion of a graph boundary is purely a matter of convenience
and
	utility.  A graph can consist of *any* set of RDF triples.  If
you wanted,
	you could apply an interpretation to a graph consisting of three
randomly
	selected triples from each RDF document on the web, but it
probably wouldn't
	be very useful to do so, because you probably would not care
about the truth
	value of that graph.  We generally only apply an interpretation
to a graph
	whose truth value we care about.
	
	An interpretation corresponds to the *use* of a graph.  Suppose
I have a
	graph that "ambiguously" uses the same URI to denote both a
toucan and its
	web page, without asserting that toucans cannot be web pages:
	
	  @prefix : <http://example/>
	  :tweety a :Toucan .
	  :tweety a :WebPage .
	
	When a conforming RDF application takes that RDF graph as input,
assumes it
	is true, and produces some output such as "Tweety is a toucan",
in effect
	the application has chosen a particular interpretation to apply
to that
	graph.  In effect, the choice of interpretation causes the app
to produce
	that particular output.  For example, the app might categorize
animals into
	species, choosing an interpretation that maps :tweety to a kind
of bird.
	But a different conforming RDF application that only cares about
web page
	authorship might take that *same* RDF graph as input and choose
a different
	interpretation that maps :tweety to a web page, instead
outputting "Tweety
	is a web page".  In effect, the app has chosen an interpretation
that is
	appropriate for its purpose.
	
	If the graph had also asserted :Toucan owl:disjointWith
:WebPage, then the
	graph cannot be true under OWL semantics, and the graph (as is)
would be
	unusable to both apps.
	
	
	

	
	  That makes me wonder whether you consider it in conformance
with the
	specs to choose different boundaries?
	
	  For example, would you consider it conforming to apply a
different
	interpretation to each statement? Or how about a different
	interpretation for each node of a statement? Do you see anything
in
	the specs against doing so?

	
	
	Sure it is in conformance with the spec.  An interpretation can
be applied
	to any graph or any RDF statement.  And certainly you could
determine the
	truth value of N different statements according to N different
	interpretations.  But would it be useful to do so?  Probably
not.
	Furthermore, if two statements are true under two different
interpretations,
	that would not tell you whether a graph consisting of those two
statements
	would be true under a single interpretation.
	
	OTOH, it *is* useful to apply different intepretations to
different graphs,
	and one reason is that you may be using those graphs for
different
	applications, each app in effect applying its own
interpretation.  But the
	fact that those graphs may be true under different
interpretations does
	*not* tell you whether the merge of those graphs will be true
under a single
	interpretation.
	
	The RDF Semantics spec only tells you how to compute the truth
value of one
	<interpretation, graph> pair at a time, but you can certainly
apply it to as
	many <interpretation, graph> pairs as you want -- in full
conformance with
	the intent of the spec.  This is the same as if I define a
function f of two
	arguments, such that f(x,y) = x+y, that function definition only
tells you
	how to compute f(x,y) for one pair of numbers at a time, but you
can
	certainly apply it to as many pairs as you want, without in any
way
	violating the intent of f's definition.
	
	David

	
	
	
	-- 
	IT Project Lead at PanGenX (http://www.pangenx.com)
	The purpose is always improvement

	 

 





 

-- 
Jim McCusker
Programmer Analyst
Krauthammer Lab, Pathology Informatics
Yale School of Medicine
james.mccusker@yale.edu | (203) 785-4436
http://krauthammerlab.med.yale.edu

PhD Student
Tetherless World Constellation
Rensselaer Polytechnic Institute
mccusj@cs.rpi.edu
http://tw.rpi.edu
Received on Wednesday, 27 March 2013 20:11:46 UTC