- From: Bob Futrelle <bob.futrelle@gmail.com>
- Date: Fri, 9 Jun 2006 10:41:38 -0400
- To: "AJ Chen" <canovaj@gmail.com>, public-semweb-lifesci@w3.org
Title: The content of biomedical research and its representation in papers
This note was stimulated by a note from A J Chen posted today, 9 June
2006, on public-semweb-lifesci@w3.org, with a link to:
http://esw.w3.org/topic/HCLS/ScientificPublishingTaskForce
I realize that I am going beyond what Chen is attempting to cover, but
his note did trigger a desire to discuss broader issues.
My note below is being sent to both the above semantic web list and to
the BioNLP.org list.
- Bob Futrelle
+++++++++++++++++++++++++++++++++++++++++
Chen's "An Ontology for Experiment Self-Publishing" is an interesting
and important topic, to be sure. But it is important to place any
such effort in the context of what we already know about how results
are published today in the biomedical research literature. A good
deal can be said about this, but I will limit myself to just four
issues in this note.
1. The nature of the discourse, especially hedging.
2. The strategies behind the design of experiments; context.
3. The use of figures to represent material not easily described in text.
4. What does this mean for formal representations of knowledge about
experiments?
+++++++++++++++++++++++++++++++++++++++++
1. The nature of discourse, and the preponderance of hedging -
This is an important point. It concerns that fact that biomedical
research papers are not simply a collection of statements of fact, far
from it. Papers are loaded with qualitative phrases and indefinite
terms. This is a natural consequence of the fact that living systems
are extraordinarily complex so that it is difficult in many (most?)
cases to make flat factual statements about them.
The simplest way to see this is to simply go through papers, sentence
by sentence, looking for hedges. A hedge is a qualifying term that
reduces the certainty of the statement or predication that it appears
in. Hedges are so numerous that there are entire books devoted to the
topic, e.g., "Hedging in Scientific Research Articles" by Ken Hyland,
308 pp (John Benjamins, 1998). See also:
http://en.wikipedia.org/wiki/Hedge_(linguistics)
Hedges in biomedical research papers appear sometimes more than once
in a single sentence, sometimes not at all in a sentence. Below are
some examples of hedge constructions I found in some recent papers,
averaging close to one hedge construction per sentence:
possibility
slightly
less than 2-fold
far greater than
similar effects
reducing
significant
reduced
indicating
indeed
synergistically increased
presumptive
well ordered
located ideally
interact with
significantly reduce
abolished
may contribute
similar to
in the present structure
joins this network
complicated
without reducing
impaired
defective
important
were implicated
voluminous
limited
dispersed randomly
rarely seen in close proximity
blocked
we hypothesize that
assertion based on
accumulated
numerous
resembling
a clue came from
associated with
reminiscent of
propensity
The list could go on and on before reaching anything resembling
closure. Some of the examples above clearly indicate the contextual
limitations of a statement, e.g., "in the present structure", and
"assertion based on".
+++++++++++++++++++++++++++++++++++++++++
2. The strategies behind the design of experiments; context -
Experiments with complex living systems or in vitro biochemical
investigations always have to look at a limited subset of systems and
phenomena, at limited aspects of a problem. All results are therefore
contextually limited. Only a modest subset of statements can be taken
out of the context in which they were determined to stand alone as
independent, context-free "facts".
A major piece of evidence for these contextually limited results are
the discourse structures that are assumed by the papers. If results
of experiments were in actuality just collections of facts, then
published papers would long ago have taken on the form of lists of
facts. That they don't is strong evidence that such a factual form for
results is unattainable for quite fundamental reasons.
Another piece of evidence is that we have had a lot of problems in
trying to replace human curators with automated processes. Humans can
read papers full of hedges, and separate the wheat from the chaff.
Sometimes firm assertions are associated with certain quite standard
constructions, so various wrappers, usually regular expressions, can
be used to do extraction.
+++++++++++++++++++++++++++++++++++++++++
3. The use of figures to represent material not easily described in text -
By "figure", I refer to diagrams such as data graphs, gene diagrams,
etc., as well as images such as micrographs, gels, and blots. If one
runs the numbers, e.g., using column-inches as a common measure, what
is found is that it is not unusual for 50% of a published paper to
revolve around the figures. That is, if we add up the space occupied
by the figures, their captions, the discussion of the figures in
explicit referring sentences ("Fig. 3 shows ....), and implicit
discussions (sentences discussing figure content but not including
"Fig."), the result is often of the order of 50% of a paper (not
counting the abstract and bibliography).
This means that any formal representation of the content of a paper
has to deal with figure content, not just the text. It is simply not
the case that the figure content is represented somewhere in the text.
The text can give the background for what is shown in a figure as
well as discuss what might be concluded from what is shown in a
figure. But the text does not replicate the figure content. Figures
are included for that reason; they are often the only way certain
information can be reasonably presented.
This also means that any contentful representation of figure content
is going to require analysis of the internal structure of figures.
They cannot be treated as black boxes, any more that we can treat a
sentence or paragraph as an unanalyzable black box.
+++++++++++++++++++++++++++++++++++++++++
4. What does this mean for formal representations of knowledge about
experiments?
What I have described above in no way precludes developing formal
representations of the content of papers. Issues of hedging and
discourse structure have been treated at length in the linguistics and
computational linguistics literature. I have personally done a good
deal of work on figure content, focusing on diagrams, rather than
images.
So how to proceed? We simply have to acknowledge the complexities
inherent in the published biomedical research literature for what they
are and develop ways to represent them. We can certainly zero in on
the firmer statements. Beyond that, all that's left is a large
collection of difficult, profound, exciting, and ultimately
enlightening problems that will keep us all busy for quite a while.
Cheers,
- Bob Futrelle
--
Robert P. Futrelle
Associate Professor
Biological Knowledge Laboratory
College of Computer and Information Science
Northeastern University MS WVH202
360 Huntington Ave.
Boston, MA 02115
Office: (617)-373-4239
Fax: (617)-373-5121
http://www.ccs.neu.edu/home/futrelle
http://www.bionlp.org
http://www.diagrams.org
http://biologicalknowledge.com
On 6/9/06, AJ Chen <canovaj@gmail.com> wrote:
> I have created a wiki page for the Scientific Publishing task force, please
> see
> http://esw.w3.org/topic/HCLS/ScientificPublishingTaskForce
>
> The first task is to develop an ontology for self-publishing of experiment.
> I have proposed a list of objects and properties related to self-publishing
> experiment. Please download the attached file under Task Status and review
> the proposal. Your feedback and comments will be greatly appreciated. You
> may also edit the file directly and email me the edited file.
>
> It's critical to have more talents to engage in the task force and its
> tasks. Let me know if you are interested in join the task. If you have any
> new idea for a new task, please make a proposal and share with the group.
>
> Thanks,
> AJ
>
Received on Friday, 9 June 2006 14:41:52 UTC