- From: Bob Futrelle <bob.futrelle@gmail.com>
- Date: Fri, 9 Jun 2006 10:41:38 -0400
- To: "AJ Chen" <canovaj@gmail.com>, public-semweb-lifesci@w3.org
Title: The content of biomedical research and its representation in papers This note was stimulated by a note from A J Chen posted today, 9 June 2006, on public-semweb-lifesci@w3.org, with a link to: http://esw.w3.org/topic/HCLS/ScientificPublishingTaskForce I realize that I am going beyond what Chen is attempting to cover, but his note did trigger a desire to discuss broader issues. My note below is being sent to both the above semantic web list and to the BioNLP.org list. - Bob Futrelle +++++++++++++++++++++++++++++++++++++++++ Chen's "An Ontology for Experiment Self-Publishing" is an interesting and important topic, to be sure. But it is important to place any such effort in the context of what we already know about how results are published today in the biomedical research literature. A good deal can be said about this, but I will limit myself to just four issues in this note. 1. The nature of the discourse, especially hedging. 2. The strategies behind the design of experiments; context. 3. The use of figures to represent material not easily described in text. 4. What does this mean for formal representations of knowledge about experiments? +++++++++++++++++++++++++++++++++++++++++ 1. The nature of discourse, and the preponderance of hedging - This is an important point. It concerns that fact that biomedical research papers are not simply a collection of statements of fact, far from it. Papers are loaded with qualitative phrases and indefinite terms. This is a natural consequence of the fact that living systems are extraordinarily complex so that it is difficult in many (most?) cases to make flat factual statements about them. The simplest way to see this is to simply go through papers, sentence by sentence, looking for hedges. A hedge is a qualifying term that reduces the certainty of the statement or predication that it appears in. Hedges are so numerous that there are entire books devoted to the topic, e.g., "Hedging in Scientific Research Articles" by Ken Hyland, 308 pp (John Benjamins, 1998). See also: http://en.wikipedia.org/wiki/Hedge_(linguistics) Hedges in biomedical research papers appear sometimes more than once in a single sentence, sometimes not at all in a sentence. Below are some examples of hedge constructions I found in some recent papers, averaging close to one hedge construction per sentence: possibility slightly less than 2-fold far greater than similar effects reducing significant reduced indicating indeed synergistically increased presumptive well ordered located ideally interact with significantly reduce abolished may contribute similar to in the present structure joins this network complicated without reducing impaired defective important were implicated voluminous limited dispersed randomly rarely seen in close proximity blocked we hypothesize that assertion based on accumulated numerous resembling a clue came from associated with reminiscent of propensity The list could go on and on before reaching anything resembling closure. Some of the examples above clearly indicate the contextual limitations of a statement, e.g., "in the present structure", and "assertion based on". +++++++++++++++++++++++++++++++++++++++++ 2. The strategies behind the design of experiments; context - Experiments with complex living systems or in vitro biochemical investigations always have to look at a limited subset of systems and phenomena, at limited aspects of a problem. All results are therefore contextually limited. Only a modest subset of statements can be taken out of the context in which they were determined to stand alone as independent, context-free "facts". A major piece of evidence for these contextually limited results are the discourse structures that are assumed by the papers. If results of experiments were in actuality just collections of facts, then published papers would long ago have taken on the form of lists of facts. That they don't is strong evidence that such a factual form for results is unattainable for quite fundamental reasons. Another piece of evidence is that we have had a lot of problems in trying to replace human curators with automated processes. Humans can read papers full of hedges, and separate the wheat from the chaff. Sometimes firm assertions are associated with certain quite standard constructions, so various wrappers, usually regular expressions, can be used to do extraction. +++++++++++++++++++++++++++++++++++++++++ 3. The use of figures to represent material not easily described in text - By "figure", I refer to diagrams such as data graphs, gene diagrams, etc., as well as images such as micrographs, gels, and blots. If one runs the numbers, e.g., using column-inches as a common measure, what is found is that it is not unusual for 50% of a published paper to revolve around the figures. That is, if we add up the space occupied by the figures, their captions, the discussion of the figures in explicit referring sentences ("Fig. 3 shows ....), and implicit discussions (sentences discussing figure content but not including "Fig."), the result is often of the order of 50% of a paper (not counting the abstract and bibliography). This means that any formal representation of the content of a paper has to deal with figure content, not just the text. It is simply not the case that the figure content is represented somewhere in the text. The text can give the background for what is shown in a figure as well as discuss what might be concluded from what is shown in a figure. But the text does not replicate the figure content. Figures are included for that reason; they are often the only way certain information can be reasonably presented. This also means that any contentful representation of figure content is going to require analysis of the internal structure of figures. They cannot be treated as black boxes, any more that we can treat a sentence or paragraph as an unanalyzable black box. +++++++++++++++++++++++++++++++++++++++++ 4. What does this mean for formal representations of knowledge about experiments? What I have described above in no way precludes developing formal representations of the content of papers. Issues of hedging and discourse structure have been treated at length in the linguistics and computational linguistics literature. I have personally done a good deal of work on figure content, focusing on diagrams, rather than images. So how to proceed? We simply have to acknowledge the complexities inherent in the published biomedical research literature for what they are and develop ways to represent them. We can certainly zero in on the firmer statements. Beyond that, all that's left is a large collection of difficult, profound, exciting, and ultimately enlightening problems that will keep us all busy for quite a while. Cheers, - Bob Futrelle -- Robert P. Futrelle Associate Professor Biological Knowledge Laboratory College of Computer and Information Science Northeastern University MS WVH202 360 Huntington Ave. Boston, MA 02115 Office: (617)-373-4239 Fax: (617)-373-5121 http://www.ccs.neu.edu/home/futrelle http://www.bionlp.org http://www.diagrams.org http://biologicalknowledge.com On 6/9/06, AJ Chen <canovaj@gmail.com> wrote: > I have created a wiki page for the Scientific Publishing task force, please > see > http://esw.w3.org/topic/HCLS/ScientificPublishingTaskForce > > The first task is to develop an ontology for self-publishing of experiment. > I have proposed a list of objects and properties related to self-publishing > experiment. Please download the attached file under Task Status and review > the proposal. Your feedback and comments will be greatly appreciated. You > may also edit the file directly and email me the edited file. > > It's critical to have more talents to engage in the task force and its > tasks. Let me know if you are interested in join the task. If you have any > new idea for a new task, please make a proposal and share with the group. > > Thanks, > AJ >
Received on Friday, 9 June 2006 14:41:52 UTC