Re: scientific publishing task force update

Title:  The content of biomedical research and its representation in papers

This note was stimulated by a note from A J Chen posted today, 9 June
2006, on public-semweb-lifesci@w3.org, with a link to:
http://esw.w3.org/topic/HCLS/ScientificPublishingTaskForce

I realize that I am going beyond what Chen is attempting to cover, but
his note did trigger a desire to discuss broader issues.

My note below is being sent to both the above semantic web list and to
the BioNLP.org list.

 - Bob Futrelle

+++++++++++++++++++++++++++++++++++++++++

Chen's "An Ontology for Experiment Self-Publishing" is an interesting
and important topic, to be sure.  But it is important to place any
such effort in the context of what we already know about how results
are published today in the biomedical research literature.  A good
deal can be said about this, but I will limit myself to just four
issues in this note.

1. The nature of the discourse, especially hedging.
2. The strategies behind the design of experiments; context.
3. The use of figures to represent material not easily described in text.
4. What does this mean for formal representations of knowledge about
experiments?

+++++++++++++++++++++++++++++++++++++++++

1. The nature of discourse, and the preponderance of hedging -

This is an important point.  It concerns that fact that biomedical
research papers are not simply a collection of statements of fact, far
from it. Papers are loaded with qualitative phrases and indefinite
terms.  This is a natural consequence of the fact that living systems
are extraordinarily complex so that it is difficult in many (most?)
cases to make flat factual statements about them.

The simplest way to see this is to simply go through papers, sentence
by sentence, looking for hedges.  A hedge is a qualifying term that
reduces the certainty of the statement or predication that it appears
in. Hedges are so numerous that there are entire books devoted to the
topic, e.g., "Hedging in Scientific Research Articles" by Ken Hyland,
308 pp (John Benjamins, 1998).  See also:
http://en.wikipedia.org/wiki/Hedge_(linguistics)
Hedges in biomedical research papers appear sometimes more than once
in a single sentence, sometimes not at all in a sentence.  Below are
some examples of hedge constructions I found in some recent papers,
averaging close to one hedge construction per sentence:

  possibility
  slightly
  less than 2-fold
  far greater than
  similar effects
  reducing
  significant
  reduced
  indicating
  indeed
  synergistically increased
  presumptive
  well ordered
  located ideally
  interact with
  significantly reduce
  abolished
  may contribute
  similar to
  in the present structure
  joins this network
  complicated
  without reducing
  impaired
  defective
  important
  were implicated
  voluminous
  limited
  dispersed randomly
  rarely seen in close proximity
  blocked
  we hypothesize that
  assertion based on
  accumulated
  numerous
  resembling
  a clue came from
  associated with
  reminiscent of
  propensity

The list could go on and on before reaching anything resembling
closure.  Some of the examples above clearly indicate the contextual
limitations of a statement, e.g., "in the present structure", and
"assertion based on".

+++++++++++++++++++++++++++++++++++++++++

2. The strategies behind the design of experiments; context -

Experiments with complex living systems or in vitro biochemical
investigations always have to look at a limited subset of systems and
phenomena, at limited aspects of a problem.  All results are therefore
contextually limited. Only a modest subset of statements can be taken
out of the context in which they were determined to stand alone as
independent, context-free "facts".

A major piece of evidence for these contextually limited results are
the discourse structures that are assumed by the papers.  If results
of experiments were in actuality just collections of facts, then
published papers would long ago have taken on the form of lists of
facts. That they don't is strong evidence that such a factual form for
results is unattainable for quite fundamental reasons.

Another piece of evidence is that we have had a lot of problems in
trying to replace human curators with automated processes.  Humans can
read papers full of hedges, and separate the wheat from the chaff.
Sometimes firm assertions are associated with certain quite standard
constructions, so various wrappers, usually regular expressions, can
be used to do extraction.

+++++++++++++++++++++++++++++++++++++++++

3. The use of figures to represent material not easily described in text -

By "figure", I refer to diagrams such as data graphs, gene diagrams,
etc., as well as images such as micrographs, gels, and blots.  If one
runs the numbers, e.g., using column-inches as a common measure, what
is found is that it is not unusual for 50% of a published paper to
revolve around the figures.  That is, if we add up the space occupied
by the figures, their captions, the discussion of the figures in
explicit referring sentences ("Fig. 3 shows ....), and implicit
discussions (sentences discussing figure content but not including
"Fig."), the result is often of the order of 50% of a paper (not
counting the abstract and bibliography).

This means that any formal representation of the content of a paper
has to deal with figure content, not just the text.  It is simply not
the case that the figure content is represented somewhere in the text.
 The text can give the background for what is shown in a figure as
well as discuss what might be concluded from what is shown in a
figure.  But the text does not replicate the figure  content.  Figures
are included for that reason; they are often the only way certain
information can be reasonably  presented.

This also means that any contentful representation of figure content
is going to require analysis of the internal structure of figures.
They cannot be treated as black boxes, any more that we can treat a
sentence or paragraph as an unanalyzable black box.

+++++++++++++++++++++++++++++++++++++++++

4. What does this mean for formal representations of knowledge about
experiments?

What I have described above in no way precludes developing formal
representations of the content of papers. Issues of hedging and
discourse structure have been treated at length in the linguistics and
computational linguistics literature.  I have personally done a good
deal of work on figure content, focusing on diagrams, rather than
images.

So how to proceed?  We simply have to acknowledge the complexities
inherent in the published biomedical research literature for what they
are and develop ways to represent them.  We can certainly zero in on
the firmer statements. Beyond that, all that's left is a large
collection of difficult, profound, exciting, and ultimately
enlightening problems that will keep us all busy for quite a while.

Cheers,

  -  Bob Futrelle
-- 
Robert P. Futrelle
    Associate Professor
Biological Knowledge Laboratory
College of Computer and Information Science
Northeastern University MS WVH202
360 Huntington Ave.
Boston, MA 02115

Office: (617)-373-4239
Fax:    (617)-373-5121
http://www.ccs.neu.edu/home/futrelle
http://www.bionlp.org
http://www.diagrams.org
http://biologicalknowledge.com


On 6/9/06, AJ Chen <canovaj@gmail.com> wrote:
> I have created a wiki page for the Scientific Publishing task force, please
> see
> http://esw.w3.org/topic/HCLS/ScientificPublishingTaskForce
>
>  The first task is to develop an ontology for self-publishing of experiment.
> I have proposed a list of objects and properties related to self-publishing
> experiment. Please download the attached file under Task Status and review
> the proposal. Your feedback and comments will be greatly appreciated.  You
> may also edit the file directly and email me the edited file.
>
>  It's critical to have more talents to engage in the task force and its
> tasks. Let me know if you are interested in join the task. If you have any
> new idea for a new task, please make a proposal and share with the group.
>
>  Thanks,
>  AJ
>

Received on Friday, 9 June 2006 14:41:52 UTC