Re: Seeking Help with finding an assertion from Kei Cheung on 2007-07-06 (public-semweb-lifesci@w3.org from July 2007)

From: Kei Cheung <kei.cheung@yale.edu>
Date: Thu, 05 Jul 2007 22:35:55 -0400
To: Chris Mungall <cjm@fruitfly.org>
Cc: "Skinner, Karen (NIH/NIDA) [E]" <kskinner@nida.nih.gov>, public-semweb-lifesci hcls <public-semweb-lifesci@w3.org>
Message-id: <468DAA8B.80102@yale.edu>
Hi Chris,

Thanks for pointing out the potential flaws of their method. It sounded 
like there is room for improvement in terms of the accuracy of database 
contents and the method of assessing database accuracy. Don't get me 
wrong. I think highly of GO. :-)

I'm also thinking more about what "negative knowledge" really means. 
Does it mean any or all of the following:

1. inconsistent knowledge
2. inaccurate knowledge
3. incomplete knowledge
4. knowledge with uncertainties

Can SW/ontologies help turn "negative knowledge" to "positive knowledge"?

-Kei

Chris Mungall wrote:

>
>
> On Jul 4, 2007, at 8:27 PM, Kei Cheung wrote:
>
>>
>> As a follow-up example, a study for estimating the error rate of  
>> Gene Ontology (GO) was done:
>>
>> http://www.pubmedcentral.nih.gov/articlerender.fcgi? 
>> artid=1892569#id2674403
>>
>> The study showed that the GO term annotation error rate estimates  
>> for the GoSeqLite database were found to be 13% to 18% for curated  
>> non-ISS annotations, 49% for ISS annotations, and 28% to 30% for  all 
>> curated annotations. (ISS stands for inferred from sequence  
>> similiarity). Despite these findings, the authors concluded that GO  
>> is a comparatively high quality source of informaton. Integration  of 
>> databases involving significant error rates, however, can impact  
>> negatively the quality of science.
>
>
> I have not yet properly digested this paper, but on a cursory reading  
> there appear to be a few serious flaws. First, a lack of  
> understanding of basic ontology principles - annotations to less  
> specific classes in the graph are treated as errors. Second, the  
> authors appear to make a lot of incorrect assumptions about how ISS  
> annotations are curated.
>
> It's curious they predict such a high error rate yet don't provide  
> any examples.
>
>>
>> -Kei
>>
>> Kei Cheung wrote:
>>
>>>
>>> Hi Karen,
>>>
>>> Your questions remind me of the following classic article written  
>>> by Robert Robbins on "Challenges in the Human Genome Project".
>>>
>>> http://www.esp.org/umdnj.pdf
>>>
>>> Although it doesn't directly answer the questions, in the  
>>> "Nomenclature Problems" section (p. 20-21), it discusses the  
>>> significant problem of inconsistent knowledge representation. It  
>>> says that it's mistake to believe  that terminology fluidity is  not 
>>> an issue biological in database design. It also says that many  
>>> biologists don't realize that, in a database bulit with 5% error  in 
>>> the definition of individual concepts, a query that joins  across 15 
>>> concepts has less than 50% chance of returning an  adequate answer. 
>>> The section also points out the importance of  formal representation 
>>> of scientific knowledge in addressing the  inconsistency and 
>>> nomenclature problems. Semantic Web and standard  ontologies provide 
>>> a solution to these database problems. We just  don't simply convert 
>>> an existing database syntactically into a  semantic web format, but 
>>> we also need to do careful semantic  conversion to eliminate as many 
>>> errors, ambiguities, and  inconsistencies as possible in order to 
>>> reduce the costs of  knowledge retrieval and discovery.
>>>
>>> -Kei
>>>
>>> Skinner, Karen (NIH/NIDA) [E] wrote:
>>>
>>>> Recently I read somewhere (on this list, a blog, a news story,  
>>>> where...?) an assertion that struck me as an interesting passing  
>>>> fact at the time.   As I recall, it indicated that more websites  
>>>> are accessed via a search engine than by typing a URL into a  
>>>> browser web address bar.
>>>>
>>>> Alas, I did not save the reference, and now I am looking for the  
>>>> proverbial needle in a haystack. Namely, what is the exact  
>>>> assertion, who asserted it, and where did they make it?  If  anyone 
>>>> in the world has this information or knows how to get it,  or or 
>>>> has related data, I imagine they would belong to this list.  I 
>>>> would be most grateful for any useful pointer.
>>>>
>>>> Along this same vein, if anyone has any statistics, data,  
>>>> anecodotes or information related to the cost of
>>>> (1) "friction" arising from inefficient or inappropriate efforts  
>>>> at information retrieval
>>>> and
>>>> (2) the cost of "negative knowledge" about an existing resource  or 
>>>> data,
>>>>
>>>> these, too, would be helpful.
>>>>
>>>> (For example, with respect to #2 above, we are all familiar with  
>>>> comparison shopping for goods and services. We seek data/ 
>>>> information about prices and quality , but at what point does the  
>>>> expenditure of that effort exceed the value of the information  
>>>> learned?)
>>>>
>>>> I am not looking for examples at the level of a philosophy or  
>>>> ecnomics Ph.D. thesis, but rather a few examples in the sciences  
>>>> that can be used at the level of an "elevator speech."
>>>>
>>>>
>>>> Karen Skinner
>>>> Deputy Director for Science and Technology Development
>>>> Division of Basic Neuroscience and Behavior Research
>>>> National Institute on Drug Abuse/NIH
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
Received on Friday, 6 July 2007 02:36:12 UTC