Re: Evidence from M. Scott Marshall on 2007-06-21 (public-semweb-lifesci@w3.org from June 2007)

From: M. Scott Marshall <marshall@science.uva.nl>
Date: Thu, 21 Jun 2007 16:24:03 +0200
To: Alan Ruttenberg <alanruttenberg@gmail.com>
CC: "Kashyap, Vipul" <VKASHYAP1@PARTNERS.ORG>, public-semweb-lifesci@w3.org, Pat Hayes <phayes@ihmc.us>
Message-ID: <467A8A03.7050907@science.uva.nl>
I see evidence as a special type of provenance for "facts", 
"observations", and "conclusions" in a knowledgebase.

Motivation for evidence is the desire to represent information about an 
experiment, such as the hypothesis. If we want to work with hypotheses, 
then we need to represent hypothetical information. But how? A uniform 
approach would treat all information as propositional or hypothetical 
rather than to have a separate class so that "hypothesis" can be 
promoted to "fact" but I digress.. :) However we represent it, we would 
like to know how our hypothetical fact is supported by evidence, such as 
protocols and methods.

Alan Ruttenberg wrote:
> Maybe we can bring this back to the main subject: What problems are we 
> trying to solve by recording evidence? What are the ways we would know 
> that we've made a mistake?
> 
> (I suspect that there will be a variety of answers to this, and I'm very 
> curious to hear what people think)

I'll try to answer this:
We want to record evidence in order to evaluate and weigh the quality of 
data/information, as well as steer and/or evaluate any conclusions that 
are made on the basis of that data. This is especially important in an 
environment for computational experiments. My test: If we can apply our 
own criterion to evaluate our confidence in a given fact, even when it 
is in someone else's knowledgebase, we have succeeded with our 
representation of the evidence. So, an example of how to represent such 
criterion reason with it about example evidence would be nice..

Evidence in Text mining
-----------------------
Suppose that we are trying to distill knowledge provided by a 
scientific article into some representation. Example: "Is the article 
about proteinX?". If so, "How relevant is proteinX to the article?" and 
so forth. If the distillation process is carried out by a person, then 
who? In the case of text mining, we might like to know what algorithms 
and techniques, queries, pattern recognizers (Bayesian or lexical 
patterns?), threshold values, etc. were used to extract knowledge. If a 
person used a text mining workflow to support the distillation process, 
then we would like the URL to the workflow WSDL (from which we can 
usually discover the other details) and to know who the person was.

In general, we would like to know the resources involved in producing a 
particular piece of data (or "fact"). We would like to know the actors, 
roles, conditions, algorithms, program versions, what rules were fired, 
and information resources.

An important challenge in the future will be to combine results from 
manual and automated processes. Most of us would tend to view "facts" 
that result from an automated process as more hypothetical or 
questionable than the same coming from a human expert. On the road to 
automation, however, we should eventually reach the point that the 
quality of "text mining"-supported (i.e. not generated!) annotations 
will be generally higher than manual-only annotation.

Evidence in Microarrays
-----------------------
I don't intend to start a debate about the particulars of microarrays 
but I think that evidence comes up in practice here throughout the 
entire process of measurement and analysis. Gene expression, as measured 
by microarrays, is actually a measurement of changes in mRNA levels at a 
particular time, which *indicates* how much change in the process of 
expression has occurred under *specific* *conditions*. So, already we 
have an example of terminology that is not ontologically accurate when 
incorrectly applied (to microarrays) - technically, measuring mRNA 
levels is not equivalent to measuring the quantity of protein product 
("expression"). But the term has been in use for so long that it remains 
acceptable to refer to microarray analysis as "expression analysis". :)

In the case of "gene expression", the statistical process of microarray 
analysis only provides a probability that a gene is up or down regulated 
  (e.g. in the common reference model). However, there is a series of 
decisions and conditions that lead up to the "call" (up, down, 
unchanged) for a particular gene and thus the resulting set of 
differentially expressed genes for the array. The following conditions 
can all be relevant to decisions in how much weight to give to the 
resulting data:

* Experimental design - organism, conditions, disease, phenotype, ..
* Source of cells, enzymes, ..
* Materials handling (thawed? how often?)
* Protocols used such as RNA extraction
* Operator
* Array layout and design - including choice of oligos
* Instrumentation details - array spotter/printer, laser type and 
calibration, ..
* Ozone levels (I'm not kidding!)
* Image analysis ("Feature Extraction") software and settings
* Type of normalization
* Criteria for discarding data as "outliers"
* Criteria for classifying gene as differentially expressed (p-value 
cutoff, ANOVA, ..)

Again, the point that I'm trying to make about microarrays is that 
evidence (as well as uncertainty), can be represented and used, even for 
the measurements ("observations") themselves. But this is not done in 
practice. Even if you wanted to simply "pool" microarray data (most 
people don't), it is very difficult to do because some of the most 
important metadata (e.g. experimental design), if available, is often in 
free text format.

-scott

p.s. My introduction to HCLS summarizes the way that I look at evidence 
a lot more succinctly than the above:  ;)
http://lists.w3.org/Archives/Public/public-semweb-lifesci/2006Feb/0131.html

-- 
M. Scott Marshall
http://staff.science.uva.nl/~marshall
http://adaptivedisclosure.org
Received on Thursday, 21 June 2007 14:24:17 UTC