Re: Observations about facts in genomics from Pat Hayes on 2013-03-22 (public-semweb-lifesci@w3.org from March 2013)

From: Pat Hayes <phayes@ihmc.us>
Date: Fri, 22 Mar 2013 01:28:49 -0500
To: Peter Ansell <ansell.peter@gmail.com>
Cc: Alan Ruttenberg <alanruttenberg@gmail.com>, Jeremy J Carroll <jjc@syapse.com>, Jerven Bolleman <me@jerven.eu>, Graham Klyne <graham.klyne@zoo.ox.ac.uk>, w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-Id: <E01ECF83-5D37-4BFD-AB20-221D4F8F7497@ihmc.us>
On Mar 21, 2013, at 9:56 PM, Peter Ansell wrote:

> On 22 March 2013 12:05, Alan Ruttenberg <alanruttenberg@gmail.com> wrote:
>> On Wed, Mar 20, 2013 at 3:15 PM, Jeremy J Carroll <jjc@syapse.com> wrote:
>>> 
>>> To me, that seems to lead us back to the earlier discussion (rathole?)
>>> about owl:sameAs
>>> I tend to a view that there are diminishing returns in terms of levels of
>>> indirection here!
>> 
>> As the number of levels of indirection increases, perhaps. But here we are
>> talking about 1 level - separating claims from truth.
> 
> The question that scientists spend their lives trying to establish is
> the one that you seem to think is clearly defined in this statement,
> ie, "seperating claims from 'truth'". In some domains, such as
> logic/mathematics, "truth" is easy to define, and that seems to be the
> basis that the RDF specifications use to justify their semantics.

Mathematics has absolutely nothing to do with this discussion. Logic is not concerned with truth per se, but rather with analysing operations upon represented data which can be guaranteed to *preserve* truth. 

The semantics of RDF is not completely arbitrary. It is admittedly simple, because RDF is simple, but it is based on insights from about 60 years of scholarship and work on models of truth in formalized languages, originating in Tarski's work on the philosophical analysis of meaning in language. It has had inputs from formal logic but also from linguistics, computer science, philosophical ontology and the philosophy of science. The foundations of this area have been thoroughly prospected for alternatives. Its techniques have been applied to all forms of data manipulation by computer and to the analysis of a huge variety of kinds of information, including diagrammatic notations, formalisms for representing all kinds of scientific data and hypotheses, probabilistic calculi and natural human languages.

> However, in others, such as life sciences (ie, the domain of
> public-semweb-lifesci), at least some of the best information we have
> is approximate idealist information that may not exactly match
> anything at all in reality

Quite so. That is exactly what logic is designed to be able to do, to express such idealizations.

> (ie, large genome reference assemblies that
> are statistically modelled from multiple samples but may not actually
> match base for base with any actual DNA strands in the real world).
> These approximations are referenced directly by scientists in their
> publications without them having to qualify every statement as
> referencing a "claim".

Scientists' publications intended to be read by others in their field will no doubt make presumptions of a degree of knowledge and sophistication in their readers which exceeds that of the general human populace. Formalisms such as RDF and OWL, however, are designed to represent data which can be processed by machines, and rather simple machines at that. RDF engines cannot be expected to bring to bear the same degree of sophistication that is presumed of readers of peer-reviewed professional publications. You are going to have to do some extra work in order to make a few things clear to these dumb machines, I am afraid. That is, if you expect them to do anything useful for you. 

> I am not sure why you say that there is only one layer of wrapping
> needed. I can think of many different situations where someone could
> have more than one layer of alternative interpretations that they may
> need to accommodate other scientists now and in the future. The 4 or
> so layers that the provenance ontology has just for published
> documents are worrying enough, and they may not be enough to map the
> complexities of genome reference assemblies, as genomics researchers
> may have a different "publication" workflow to book publishers.
> 
>> 2) I think there's a big difference between what one publishes on the web,
>> and what one uses in the privacy of one's home, so to speak. If one is
>> publishing on the web, it is good citizenship to respect specifications, and
>> to consider the impact of one's assertions on the broader data consumer
>> community. That consideration, IMO, is justification enough for the 1 extra
>> indirection necessary to not make statements that are too strong.
> 
> The specifications seem to be based on premises that the practicing
> scientists may not ever accept. Ie, the idea that there is static
> scientific "truth" that can be unamgiuously and continuously
> communicated, and not "challengable current theories" that can be
> either alternatively stated, or gradually or suddenly revoked and
> replaced with new best theories.

Nothing in RDF presumes anything about the nature of scientific truth. RDF is intended for publishing and communicating simple propositional information on the Web in a machine-processible form. It is a simple language which does not have any provision for expressing doubt or degrees of committment, or indeed of disjunctive alternatives (although OWL and other RDF extensions may have such expressiveness.) It also does not have any built-in expressivity for saying that one piece of RDF supercedes or replaces another; but again, such expressivity can be added by semantic extensions, in this case typically by using RDF to express meta-data about other RDF sources. But all this extra stuff, if required, must be said explicitly somehow, because the machines that are processing this stuff are really not very knowledgable, and have to have the simplest things explained to them. (They are also extremely literal-minded and fussy about details, which admittedly gets tiresome, but what can one do?)

In addition, the fact that RDF is designed as a web-based notation forces some design choices simply because of the nature of the Web. For example, once some RDF is published, it cannot be unpublished. Even if the orginal source is unplugged from the Web (and that is considered to be bad practice), it may well have been archived somewhere, and not by its original owner. The state of the Web is dynamic, but it also has pieces of old state lying around, so to speak, in ways that are outside of anyone's control. 

> Scientists need to be able to
> interpret, contrast, and concurrently utilise, past information
> directly without having to suddenly wrap up past "truths" inside of
> "claims" because they may be out of date with something someone else
> has now put into the RDF-sphere. The whole idea that statements could
> be "too strong" takes its basis from "static truth" and I cannot
> personally accept that we need to represent everything for life
> sciences inside of "claims" (or alternatively have everyone create new
> URIs for everything they want to talk about) just incase it changes in
> future or someone would find it difficult to deal with the statement
> if their application relies on a different structure for their queries
> to work.

It sounds, then, as though you require a global convention that allows the "current" state of the theory to be always dynamically adjustable without needing to state this explicitly. In order to make this work, you need to provide some way in which previously published content which is no longer considered to be currently true, is recognizable as having this 'now-superceded' status. How do you propose to do this, on an open Web? If data is simply published as the "current" truth, and then is later superceded by a new "current" truth, who is responsible for un-currenting the old previously-current RDF, which is no longer current? How could anyone possibly find it all, including all their copies and all the conclusions that were drawn from them? 

To be less rhetorical, there is considerable experience now in how to arrange for new data to supercede older, deprecated, data, at least at the source. But this does require some mutual discipline to be used by both the publishers of the data and those intending to to use it: the latter having the responsibility to make sure their data is up to date, and agreeing to always use the source rather than any archive or copy, and the former having that of keeping the record system accurate. It is not an open-web situation: it needs hand-shaking. 

> If someone else has a completely different problem domain that would
> find it difficult to deal with direct, "un-framed"/"un-claim-wrapped"
> statements from third-parties using a URI because they clash with some
> of their statements or assumptions, how would the claim wrapping
> practically help them?
> 
> Life scientists attempting to use RDF to model their heterogeneous
> information aren't trying to make ambiguous statements or reject the
> wisdom of the logic/maths backgrounds of the specifications authors,
> they are just trying to get work done, and it seems that we are being
> told that we are bad citizens for having a complex, "un-truthy"
> domain.

All citizens have certain responsibilities if they are going to use a global interchange format of any kind, which is to find a way to encode their domain in that format in a way that conforms to the published rules of the format. Or if that is not possible, then at least to publish the ways in which they are failing to conform, and to ensure that readers of their data have adequate warning of the ways they are failing to conform, and what the consequences are. Which really amounts to defining a new version of the format, in fact. There are general principles, now widely respected, for how to do this on the Web, for example making sure that your root URIs link to a specification of how you propose to use the names that you make from those roots, just as the meaning of 
http://www.w3.org/1999/02/22-rdf-syntax-ns#Property 
is formally defined in the document you can find at  
http://www.w3.org/1999/02/22-rdf-syntax-ns. 

So by all means invent a biosciences-friendly, less 'mathematical', notation, and publish it for others to use. The Web is always available :-)

Pat Hayes


> 
> Cheers,
> 
> Peter
> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Friday, 22 March 2013 06:59:39 UTC