Re: owl:sameAs - Harmful to provenance? from Pat Hayes on 2013-03-28 (public-semweb-lifesci@w3.org from March 2013)

From: Pat Hayes <phayes@ihmc.us>
Date: Wed, 27 Mar 2013 19:26:10 -0500
To: Rafael Richards <rafaelrichards@jhu.edu>
Cc: Oliver Ruebenacker <curoli@gmail.com>, David Booth <david@dbooth.org>, "<public-semweb-lifesci@w3.org>" <public-semweb-lifesci@w3.org>
Message-Id: <BA983777-BB87-456B-84D0-D313DD7D5432@ihmc.us>
On Mar 27, 2013, at 12:31 PM, Rafael Richards wrote:

> This has been a very prolific thread, but did we discuss provenance?
> 
> A slideshare on  owl:sameAs - Harmful to Provenance is here:
> 
> http://www.slideshare.net/jpmccusker/owlsameas-considered-harmful-to-provenance
> 
> Presentation Abstract:
> GOTO was once a standard operation in most computer programming languages. Edsger Dijkstra argued in 1968 that GOTO is a low level operation that is not appropriate for higher-level programming languages, and advocated structured programming in its place. Arguably, owl:sameAs in its current usage may be poised to go through a similar discussion and transformation period. In biomedical research, the provenance of information gathered is nearly as important as, and sometimes even more important than, the information itself. owl:sameAs allows someone to state that two separate descriptions really refer to the same entity. Currently that means that operational systems merge the descriptions and at the same time, merge the provenance information, thus losing the ability to retrieve where each individual description came from.

If this really is the case, then "operational systems" are mis-using OWL. From  A owl_sameAs B it follows that A and B are the same thing. It does not follow that some piece of information about A is the same as some other piece of information about B, which is what would be needed to validly merge provenance information. 

> This merging of provenance can be problematic or even catastrophic in biomedical applications that demand access to provenance information.

Indeed, and it is not required by, nor does it validly follow from the meaning of, owl:sameAs. Agreed, provenance information is important. But to draw the conclusion that therefore something is wrong with owl:sameAs is a mistake, both logical and methodological. (There are  indeed propblems with owl:sameAs in practice, but they are not concerned with provenance.)

> Based on our knowledge of integration issues of data in biomedicine, we give examples as use cases of this issue in biospecimen management and experimental metadata representations. We suggest that systems using any construct like owl:sameAs must provide an option preserve the provenance of the entities and ground assertions related to those entities in question.

That "option" is already available, if you use owl:sameAs correctly (and do not confuse information about some thing with meta-information about that information. The meta-information is not about the thing. 

Pat Hayes


> 
> 
> Rafael
> 
> Rafael M. Richards, M.D., M.S.
> Assistant Professor, Anesthesiology & Critical Care Medicine
> Faculty, Division of Health Science Informatics
> Johns Hopkins School of Medicine
> Baltimore, MD 2224-2760
> rafaelrichards [at] jhu edu
> 
> 
> 
> On Mar 27, 2013, at 11:02 AM, Oliver Ruebenacker <curoli@gmail.com>
>  wrote:
> 
>>     Hello David,
>> 
>>  So if I understand your view correctly, then it could be expressed
>> in a language close to yours as:
>> 
>>  "Some people believe that if a URI occurs twice within a graph or
>> statement, it refers to the same thing. But this is a myth! RDF never
>> guarantees that two occurrences of the same URI mean the same thing."
>> 
>>     Take care
>>     Oliver
>> 
>> On Wed, Mar 27, 2013 at 9:37 AM, David Booth <david@dbooth.org> wrote:
>>> Hi Oliver,
>>> 
>>> On 03/25/2013 04:02 PM, Oliver Ruebenacker wrote:
>>>> 
>>>>      Hello David,
>>>> 
>>>>   We agree that there are different interpretations. But you haven't
>>>> shown that the boundaries between interpretations are graphs
>>>> boundaries (others, including me, think that each interpretation is
>>>> global).
>>> 
>>> 
>>> I don't know what you mean by "boundaries between interpretations".
>>> An interpretation may be applied to any graph or statement to determine its
>>> truth value (or to a URI to determine the resource to which it is bound in
>>> that interpretation).
>>> 
>>> The notion of a graph boundary is purely a matter of convenience and
>>> utility.  A graph can consist of *any* set of RDF triples.  If you wanted,
>>> you could apply an interpretation to a graph consisting of three randomly
>>> selected triples from each RDF document on the web, but it probably wouldn't
>>> be very useful to do so, because you probably would not care about the truth
>>> value of that graph.  We generally only apply an interpretation to a graph
>>> whose truth value we care about.
>>> 
>>> An interpretation corresponds to the *use* of a graph.  Suppose I have a
>>> graph that "ambiguously" uses the same URI to denote both a toucan and its
>>> web page, without asserting that toucans cannot be web pages:
>>> 
>>>   @prefix : <http://example/>
>>>   :tweety a :Toucan .
>>>   :tweety a :WebPage .
>>> 
>>> When a conforming RDF application takes that RDF graph as input, assumes it
>>> is true, and produces some output such as "Tweety is a toucan", in effect
>>> the application has chosen a particular interpretation to apply to that
>>> graph.  In effect, the choice of interpretation causes the app to produce
>>> that particular output.  For example, the app might categorize animals into
>>> species, choosing an interpretation that maps :tweety to a kind of bird.
>>> But a different conforming RDF application that only cares about web page
>>> authorship might take that *same* RDF graph as input and choose a different
>>> interpretation that maps :tweety to a web page, instead outputting "Tweety
>>> is a web page".  In effect, the app has chosen an interpretation that is
>>> appropriate for its purpose.
>>> 
>>> If the graph had also asserted :Toucan owl:disjointWith :WebPage, then the
>>> graph cannot be true under OWL semantics, and the graph (as is) would be
>>> unusable to both apps.
>>> 
>>>> 
>>>>   That makes me wonder whether you consider it in conformance with the
>>>> specs to choose different boundaries?
>>>> 
>>>>   For example, would you consider it conforming to apply a different
>>>> interpretation to each statement? Or how about a different
>>>> interpretation for each node of a statement? Do you see anything in
>>>> the specs against doing so?
>>> 
>>> 
>>> Sure it is in conformance with the spec.  An interpretation can be applied
>>> to any graph or any RDF statement.  And certainly you could determine the
>>> truth value of N different statements according to N different
>>> interpretations.  But would it be useful to do so?  Probably not.
>>> Furthermore, if two statements are true under two different interpretations,
>>> that would not tell you whether a graph consisting of those two statements
>>> would be true under a single interpretation.
>>> 
>>> OTOH, it *is* useful to apply different intepretations to different graphs,
>>> and one reason is that you may be using those graphs for different
>>> applications, each app in effect applying its own interpretation.  But the
>>> fact that those graphs may be true under different interpretations does
>>> *not* tell you whether the merge of those graphs will be true under a single
>>> interpretation.
>>> 
>>> The RDF Semantics spec only tells you how to compute the truth value of one
>>> <interpretation, graph> pair at a time, but you can certainly apply it to as
>>> many <interpretation, graph> pairs as you want -- in full conformance with
>>> the intent of the spec.  This is the same as if I define a function f of two
>>> arguments, such that f(x,y) = x+y, that function definition only tells you
>>> how to compute f(x,y) for one pair of numbers at a time, but you can
>>> certainly apply it to as many pairs as you want, without in any way
>>> violating the intent of f's definition.
>>> 
>>> David
>> 
>> 
>> 
>> -- 
>> IT Project Lead at PanGenX (http://www.pangenx.com)
>> The purpose is always improvement
>> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Thursday, 28 March 2013 00:26:36 UTC