problematic aspects of rooting semantic graphs in the use of lexical tags from William Bug on 2006-08-23 (www-tag@w3.org from August 2006)

From: William Bug <William.Bug@DrexelMed.edu>
Date: Wed, 23 Aug 2006 02:44:37 -0400
To: Marja Koivunen <marja@annotea.org>, Tim Berners-Lee <timbl@w3.org>, Xiaoshu Wang <wangxiao@musc.edu>, "Miller, Michael D (Rosetta)" <Michael_Miller@Rosettabio.com>, Alan Ruttenberg <alanruttenberg@gmail.com>, Mark Wilkinson <markw@illuminae.com>, w3c semweb hcls <public-semweb-lifesci@w3.org>, www-tag@w3.org, Adrian Walker <adrianw@snet.net>
Message-Id: <9F820BC2-E8C7-4877-8099-98FFD7A3F8A9@DrexelMed.edu>
Hi All,

There are examples of systems that strive to separate the lexicon  
from the ontology, so as to ensure one particular lexical view of the  
underlying semantics doesn't "lock out" either humans or machines who  
do not "understand" that lexicon.  Few are perfect, but many have  
effectively handled the issue of semantic interoperability, though  
often not at the level of semantic granularity required by experts at  
the bleeding edge of a specific scientific field.

An ontology is of little use to anyone - person or machine - without  
instantiating it via a lexicon.  Where very significant problems  
arise is when the lexicon is confused with the universals the  
ontology is intended to formally represent.  I realize this boundary  
may appear artificial to some, but those who've worked on such issues  
for decades in the library & info sciences and in computational  
linguistics - despite some disagreement at the edges - will generally  
see this boundary as useful - even if they agree to disagree on  
whether it is in fact an artifact of human linguistic expression or a  
more fundamental expression of a sort of Heisenberg Uncertainty  
principle of KE/KR/KD.  What I mean is the moment an algorithm tries  
to compute on an ontological expression in the context of specific  
data instances - whether the algorithm resides in silico or in a  
human brain - it "breaks" the universal nature of the principles and  
grounds it in a lexicon used to address the specific existential  
instances being manipulated within the domain of a specific  
application.  I believe this issue is at the heart of some  
significant confusion regarding what an ontology is and the tasks it  
can help to implement.

An effective and practical knowledge resource needs to include both  
ontological graphs and a complex lexical repository.

I think where "ontology" construction often goes wrong is when it is  
not EXPLICIT and - of equal importance- quite SYSTEMATIC regarding  
the lexical extensions it includes - e.g., abbreviations,  
misspellings, various types of synonyms, homographic homonyms (the  
bane of NLP efforts everywhere), etc..

I was just listening to Michio Kaku discussing the recent controversy  
regarding the redefinition of "planet" status.  As he and the  
astronomer Ken Croswell were discussing the issue, Dr. Kaku brought  
up the story from Richard Feinmann's biography regarding the  
difference between "naming" an entity and studying the fundamental  
properties and rules relating the continuum of entities in the  
physical world.  Both the naming and the formalisms for  
characterizing the fundamentals are human artifacts - BUT what  
separates the naming from the expression of universals is the latter  
is guided by our increasing level of insight and understanding of  
real-world entities and the ways in which they relate to one  
another.  No such criterion exists for the naming process, and this  
is why it is extremely helpful to keep the lexicon characterizing  
these names distinct from the expression of fundamentals (the  
ontologies).  This is also an issue addressed by Gottfried W. von  
Leibniz in his philosophical works which all derived from the insight  
he had as a child that it MIGHT be possible to create a computable  
formalism for ontological entities analogous to the system created by  
mathematicians for performing axiomatic proofs in geometry.  In MANY  
ways, our efforts here date back to this work by Leibniz via several,  
related historical threads in mathematics, philosophy, and various  
computationally-oriented scientific fields.

One other general point - obviously the strategies and "best  
practices" for addressing these issues in the context of existing  
(and historical) data records including the literature are somewhat  
different, as opposed to what we hope to see researchers doing going  
forward.  In an ideal world - say 10 years form now - we can hope to  
see publication mechanisms in place both for primary data, supporting  
reduction/analysis/interpretation, and the larger world of the  
scientific literature - systems such as SWAN and some of the more  
advanced systems in development at BioMed Central and PLoS - to help  
reduce the complexity of the lexical Babel-esque landscape we must  
currently contend with.  This needs to be done in a manner that  
doesn't in any way restrict the expressiveness of lexicon or the  
onotological foundations, while also being implemented in a highly  
intuitive manner not requiring the researcher learn a complex formal  
means to express themselves beyond the existing complexity typically  
used amongst domain experts.  This is why I'd still place this 10  
years out.  I don't think that's too optimistic a duration, however,  
given some of the revolutionary changes being introduced both by the  
SWTech C.S. community, as well as by the community of researchers  
embedded in the increasingly less messy process of biomedical  
ontology development and use.  Some of these more modern scientific  
publication systems will come on line much sooner than this, but  
probably only in restricted contexts where there is a centralized  
authority that can both provide technical resources to develop,  
support, and evolve the systems, as well as enforce a certain level  
of compliance amongst its users - e.g., caBIG, the eScience myGRID  
project, REWERSE, The MIND Center at MGH, the BIRN project, etc..   
For better or worse, as great a profile as these organizations  
represent, the landscape of working neuroscientists extends way  
beyond this privileged environment, and we all hope to see our  
efforts be of use and relevant to all neuroscientists (given the  
current scope of the HCLSIG hosted efforts is focussed on the  
neurosciences) and the value it can help neuroscientists realize for  
society at-large.

As an example of where things can go wrong when convolving the  
lexicon with the ontology, take an artifact as relatively simple and  
seemingly "self-evident" as the "preferred label" or "preferred term"  
for a node in an ontological graph.  In making the assertion  
"preferred", there is the implication some person or agency has  
passed judgement on the term.  Reconciling two ontologies with  
overlapping knowledge domains can be made unnecessarily difficult  
when this implied contract is not made explicit.  In other words, if  
you focus on reconciling the terms rather than reconciling the  
underlying semantic graphs, you can run into many unnecessary  
problems.  I believe this issue is related to many of the discussions  
we've had on this list over the past 3 months both regarding ontology  
construction and use, as well as URI uniqueness and versioning  
contract.  Formalisms such as SKOS can be extremely helpful in this  
regard, as we need to compute on the lexicon, as well as the  
ontological graph.

To offer a relatively simple and ubiquitous example from neuroscience  
- on one side of the pond they prefer "neurone", while on the other  
"neuron" is standard term.  Is one more true?  Do they refer to  
different, underlying fundamental entities?  Can we even call the  
underlying entities "fundamental" when any neuroscientist would admit  
there is no neuron/neurone which has been explicitly qualified down  
to the level of all it's constituent molecules**, along with their  
explicit disposition in space and time?

I won't hold you at bay.  I'll give you my sense of the "practical"  
answers to these questions.

	 Is one more true?
Obviously not, since they are just lexical habits, as opposed to  
fundamental differences in the view of the world.

	Do they refer to different, underlying fundamental entities?
This is a harder call - and very context dependent, obviously.  It  
will be acutely sensitive to the level of granularity of the  
information provided on the neuron/neurone.  If you presented two  
neuroscientists with coarse-grained data on a neuron/neurone, it is  
likely they could come to agreement they both were referring to the  
same fundamental entity when they named the source of that data as a  
neuron/neurone.

	Can we call the underlying entities "fundamental" when any  
neuroscientist would admit there is no neuron/neurone which has been  
explicitly qualified down to the level of all it's constituent  
molecules, along with their explicit disposition through time?
What happens when you provide more detailed information regarding the  
purported neuron/neurone - say sufficient detail so that the two  
neuroscientists find aspects of data interpretation that are  
incommensurable in the Kuhnian sense (http://plato.stanford.edu/ 
entries/thomas-kuhn/).  Then, even if the two referred to the  
biological material entity that was the source of the data as a  
"neuron", they would likely not agree they were referring to the  
same, underlying fundamental entity.  This is not unlike the  
situation described several posts below in this thread regarding a  
"gene".  There could be a gene X identified by gene finding algorithm  
1, an "identical" gene X (in terms of the coding sequences it  
contains) derived from gene finding algorithm 2, the same gene X  
defined via a chromosomal walk, and finally a gene X defined via  
conventional genetic complementarity or hybrid mapping.  They could  
all contain the same coding sequence - or the same as yet  
functionally unidentified ESTs.  What it comes down to here, as Mark  
Wilkinson stated deep in the thread is there is much confusion  
regarding what actual material entity is being referenced - or  
whether a material entity is being referenced at all.

In the end, I hope what SWTech can help us do is provide a robust,  
shared means to express the semantic facts about the data collected,  
as well as providing a dynamic and semi-automatic means to improve  
our characterization of the fundamentals -  semi-automatic in the  
sense of "augmentation" of human intellectual abilities along the  
lines pursued by Doug Engelbart and Vanevar Bush before him.  If we  
can devise a technical infrastructure allowing the formal, shared,  
semantic description of data to evolve toward an ever converging  
sense of what the true underlying entities are, then many of the  
misgivings folks have regarding the use of ontological frameworks to  
formally express semantic information will very likely fade.

Cheers,
Bill


**Biophysicists who study ion-channel kinetics, protein folding  
dynamics, rhodopsin-based photon detection, mitochondrial energy  
transfer, etc. would probably also include quantum level formalisms  
to represent the states and dynamics of atoms, electrons, and sub- 
atomic particles.


On Aug 22, 2006, at 3:57 PM, Marja Koivunen wrote:

>
> I agree, consistent use of terms makes life easier for machines and  
> for humans too when the terms have been agreed on, learned, and  
> understood. Unfortunately, this takes a lot of effort and  
> dedication from the humans. Learning a whole ontology before  
> anything can be done is a bit like reading the whole manual of a  
> DVD player before one can use that. And we all know that while  
> there are people who actually read the whole manual, they are a  
> minority.
>
> As a usability person I always like to see the machines support the  
> humans as much as possible and not vice versa.
> In my view, new inventions often start from not so great terms and  
> evolve stepwise as learning happens. Often terms are first shared  
> and polished in small groups and later links are made between  
> groups that may use different terminologies for similar things. If  
> we want to support humans doing inventions I think we should  
> support the use of different terms, their evolution, and making  
> connections between similar terms when they are discovered as much  
> as possible. And I think Semantic Web is great for that.
>
> Marja
>
> Tim Berners-Lee wrote:
>
>>
>> Yes, indeed.  Machine processing of information relies on
>> consistent usage of terms. You can't reuse information for
>> new problems when its use requires human intervention to  
>> disambiguate  it.
>>
>> Tim Berners-Lee
>>
>> On Aug 10, 2006, at 21:54, wangxiao@musc.edu wrote:
>>
>>>
>>> Quoting "Miller, Michael D (Rosetta)"  
>>> <Michael_Miller@Rosettabio.com>:
>>>
>>>> You're correct here but it is the state of the art.  Interestingly
>>>> enough, I've found that in general the biology-based scientists and
>>>> investigators are not all that bothered by this confusion and  
>>>> despite
>>>> the confusion seem to make their way through it.
>>>
>>>
>>> The problem is that semantic web is intended to make machine to   
>>> understand.  And
>>> the clarity is a prerequisite to instruct machine unambigously.
>>>
>>> Xiaoshu
>>>
>>
>>
>
>

Bill Bug
Senior Research Analyst/Ontological Engineer

Laboratory for Bioimaging  & Anatomical Informatics
www.neuroterrain.org
Department of Neurobiology & Anatomy
Drexel University College of Medicine
2900 Queen Lane
Philadelphia, PA    19129
215 991 8430 (ph)
610 457 0443 (mobile)
215 843 9367 (fax)


Please Note: I now have a new email - William.Bug@DrexelMed.edu







This email and any accompanying attachments are confidential. 
This information is intended solely for the use of the individual 
to whom it is addressed. Any review, disclosure, copying, 
distribution, or use of this email communication by others is strictly 
prohibited. If you are not the intended recipient please notify us 
immediately by returning this message to the sender and delete 
all copies. Thank you for your cooperation.
Received on Wednesday, 23 August 2006 06:44:58 UTC