Re: BioRDF Telcon from Kei Cheung on 2009-12-14 (public-semweb-lifesci@w3.org from December 2009)

From: Kei Cheung <kei.cheung@yale.edu>
Date: Sun, 13 Dec 2009 21:32:41 -0500
To: mdmiller <mdmiller53@comcast.net>
CC: Jim McCusker <james.mccusker@yale.edu>, Helena Deus <helenadeus@gmail.com>, HCLS <public-semweb-lifesci@w3.org>
Message-ID: <4B25A3C9.9090504@yale.edu>
Hi Michael,

Thanks for pointing to MSigDB. I inlucded this in the related links 
section of the microarray use case description 
(http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/QueryFederation2). Also 
please see my response below.

Cheers,

-Kei

mdmiller wrote:

> hi all,
>
> here is he link to Molecular Signatures Database (MSigDB):
>
> [1]: http://www.broadinstitute.org/gsea/msigdb/
>
> cheers,
> michael
>
> ----- Original Message ----- From: "mdmiller" <mdmiller53@comcast.net>
> To: "Kei Cheung" <kei.cheung@yale.edu>
> Cc: "Jim McCusker" <james.mccusker@yale.edu>; "Helena Deus" 
> <helenadeus@gmail.com>; "HCLS" <public-semweb-lifesci@w3.org>
> Sent: Thursday, December 10, 2009 6:49 AM
> Subject: Re: BioRDF Telcon
>
>
>> hi kei,
>>
>>> To me, ontologies can be used to facilitate integrated semantic 
>>> queries across experiments/datasets.
>>
>>
>> yes, and this is starting to become a reality.  this effort, along 
>> with other HCLS initiatives are helping to pave the way.
>>
>>> While some of the protocols are standardized, the data protocols for 
>>> obtaining things like gene lists vary a lot. One of my questions is 
>>> that can such data analysis protocols be somehow entered into mage-tab.
>>
>>
>> yes it can be, along with the gene list, but in practice this is not 
>> done by the submitter.  after the Derived Array Data representing the 
>> normalized data, like CHP files, there can be one or more Protocol 
>> REF columns describing the analysis to obtain the gene list followed 
>> by a Derived Array Data Matrix File that is the gene list with its 
>> signature.
>>
>> perhaps MIAME needs to be extended to state this.  it's something 
>> i'll be bringing up with the MGED board.  it's just now that this has 
>> become something of value to be machine readable.  besides GeneSigDB, 
>> there is another effort, MSiqDB [1], that is also curating gene 
>> lists.  so the community is beginning to see the value of this.
>>
Yes, for these gene lists to be of value to researchers, rich annotation 
is key. The challenge here is that it's quite tedious to enter the 
custom data analysis protocols in a structured way by hand.


>>> At least for now, I don't think we need to convert the huge primary 
>>> data files (e.g., CEL file) into RDF. For the time being, we are 
>>> more focused on the processed gene lists that may be associated with 
>>> more biological meanings.
>>
>>
>> perhaps its worthwhile considering using an ontology 'raw data' class 
>> for raw data that contains a reference to the data file.  one could 
>> then use appropriate analysis tools to produce normalized data which 
>> could then also be referenced by a 'normalized data' class.
>
It seems to make sense.


>>
>> cheers,
>> michael
>>
>> ----- Original Message ----- From: "Kei Cheung" <kei.cheung@yale.edu>
>> To: "mdmiller" <mdmiller53@comcast.net>
>> Cc: "Jim McCusker" <james.mccusker@yale.edu>; "Helena Deus" 
>> <helenadeus@gmail.com>; "HCLS" <public-semweb-lifesci@w3.org>
>> Sent: Monday, December 07, 2009 7:32 AM
>> Subject: Re: BioRDF Telcon
>>
>>
>>> mdmiller wrote:
>>>
>>>> hi jim and lena,
>>>>
>>>> great progress!  this will be a nice tool.
>>>>
>>>> a couple of comments.
>>>>
>>>> 1) i think ProtocolApplication is based seen as an individual 
>>>> instance of the Protocol class.  quite often there are arguments 
>>>> whether ontologies should have individuals or be simply classes.  
>>>> to me, that doesn't apply here where real world objects are being 
>>>> connected to ontologies.  the BioSource is realized as the  'Source 
>>>> Name' column in MAGE-TAB and those entries represent real people in 
>>>> studies, mice or rats in non-clinical studies, etc., and the 
>>>> characteristics values like age represent real individual instances 
>>>> of age.  in the same way, the values in the Protocol REF column of 
>>>> MAGE-TAB are real wet-lab or analysis individual instances of 
>>>> protocols, called protocol applications in MAGE-OM.
>>>
>>> It sounds like we need to look at how to map column names and 
>>> entries to classes, instances, and relationships appropriately.
>>>
>>>>
>>>> failure to make this distinction, to me, has obscured how much 
>>>> value ontologies can have in the real world.  too often i see 
>>>> ontologies seen in and of themselves, which has its own value 
>>>> certainly, but not for the use cases i have dealing with real 
>>>> biological data.
>>>
>>>
>>> To me, ontologies can be used to facilitate integrated semantic 
>>> queries across experiments/datasets.
>>>
>>>>
>>>> 2) the usefulness, for this use case, of the information between 
>>>> the 'Source Name' and its characteristics and the 'Derived Array 
>>>> Data Matrix File' or 'Derived Array Data File' has limited 
>>>> usefulness, error correction and normalization can make some 
>>>> difference but if the provider of the MAGE-TAB is trusted, all that 
>>>> is pretty routine these days.  the above combined with experimental 
>>>> factors and experiment design info is probably 95% to 99.9% the 
>>>> worthwhile information from the MAGE-TAB.  if one notices a 
>>>> difference in the final gene set between two experiments that look 
>>>> the same, only then it might be worthwhile going into more detail.
>>>>
>>>> and has been noted the MAGE-TAB information needs to be 
>>>> supplemented with the information on the final gene set, its 
>>>> expression values, and the higher-level level analysis that was 
>>>> used, that is buried in the paper usually.
>>>
>>> While some of the protocols are standardized, the data protocols for 
>>> obtaining things like gene lists vary a lot. One of my questions is 
>>> that can such data analysis protocols be somehow entered into mage-tab.
>>>
>>>>
>>>> 3) i'm not sure if there was a desire to capture the raw data in 
>>>> the RDF. that will be, for affymetrix, a million to six million 
>>>> probes in the CEL file, even the processed data in the CHP file 
>>>> would have 20,000 to 60,000 probe sets.  i'm not sure if that is 
>>>> the best way to represent that.
>>>
>>> At least for now, I don't think we need to convert the huge primary 
>>> data files (e.g., CEL file) into RDF. For the time being, we are 
>>> more focused on the processed gene lists that may be associated with 
>>> more biological meanings.
>>>
>>> Cheers,
>>>
>>> -Kei
>>>
>>>>
>>>> cheers,
>>>> michael
>>>>
>>>> Michael Miller
>>>> mdmiller53@comcast.net
>>>>
>>>> ----- Original Message ----- From: "Jim McCusker" 
>>>> <james.mccusker@yale.edu>
>>>> To: "Helena Deus" <helenadeus@gmail.com>
>>>> Cc: "Kei Cheung" <kei.cheung@yale.edu>; "mdmiller" 
>>>> <mdmiller53@comcast.net>; "HCLS" <public-semweb-lifesci@w3.org>
>>>> Sent: Monday, November 30, 2009 8:19 AM
>>>> Subject: Re: BioRDF Telcon
>>>>
>>>>
>>>> I'm following a similar strategy, but have been folowing the MGED
>>>> ontology where possible. I've finished aligning the IDF portion, and
>>>> have started on SDRF. MGED ontology is missing a property and class
>>>> for what is often termed as ProtocolApplication, which usually serves
>>>> as an edge between derived from and derived nodes, while linking to
>>>> the protocol used for the derivation. I am planning on creating this
>>>> link in a MAGE extensions ontology, but would like to vet the
>>>> structure here:
>>>>
>>>> ProtocolApplication is a class.
>>>>
>>>> New properties:
>>>>
>>>> has_derivation_source
>>>> has_derivative
>>>>
>>>> And then ProtocolApplication would have the restrictions:
>>>>
>>>> has_protocol some Protocol
>>>>
>>>> I don't put, domains, etc. on the derived properties to allow use in
>>>> directly describing derivations if people so choose. There is no
>>>> superclass for all nodes that can be derived or derived from, so I'm
>>>> not bothering with restrictions for those, although I could add a
>>>> union restriction to it.
>>>>
>>>> If this structure us acceptable to people, I can publish the ontology
>>>> for general use pretty quickly, and let us work from the same data
>>>> structure. I would appreciate any feedback.
>>>>
>>>> Jim
>>>>
>>>> On Monday, November 30, 2009, Helena Deus <helenadeus@gmail.com> 
>>>> wrote:
>>>>
>>>>> @Kei,
>>>>>
>>>>>
>>>>>
>>>>> When you said data structure, did you mean the RDF structure
>>>>> For now, all I have is the java object returned by parser. I've 
>>>>> been using Limpopo, which creates an object that I can then parse 
>>>>> to RDF uing Jena. The challenge, though, has been coming up with 
>>>>> the predicates to formalize the relationships between the various 
>>>>> elements. I'm using the XML structures fir IDF/SDRF etc. at 
>>>>> http://magetab-om.sourceforge.net to automatically generate the 
>>>>> structure that will contain the data. My plan is to then create 
>>>>> the RDF triples that use the attributes described in those 
>>>>> documents and populate them with the data from the MAGE-TAB java 
>>>>> object created by Limpopo.
>>>>>
>>>>> Right now all I have is a very raw RDF/XML document describing the 
>>>>> relationships in the IDF structure: 
>>>>> http://magetab2rdf.googlecode.com/svn/trunk/magetabpredicates.rdf
>>>>> The triples for that had to be encoded manually using Jena by 
>>>>> reading the model.
>>>>> @Satya and Jun
>>>>> I would very much like to be involved in that effort, do you 
>>>>> already have a URL that I can look at?
>>>>>
>>>>> ThanksLena
>>>>> On Tue, Nov 24, 2009 at 2:19 PM, Kei Cheung <kei.cheung@yale.edu> 
>>>>> wrote:
>>>>> Hi Lena et al,
>>>>>
>>>>> When you said data structure, did you mean the RDF structure. If 
>>>>> so, is a pointer to the structure that we can look at?
>>>>>
>>>>> As discussed during yesterday's call, Jun and Satya will help 
>>>>> create a wiki page for listing some of the requirements for 
>>>>> provenance/workflow in the context of gene lists, perhaps we 
>>>>> should also use it to help coordinate some of the future 
>>>>> activities (people also brought up Taverna during the call 
>>>>> yesterday). Please coordinate with Satya and Jun.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -Kei
>>>>>
>>>>> Helena Deus wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I apologize for missing the call yesterday! It seems you had a 
>>>>> pretty interesting discussion! :-)
>>>>> If I understand Michael's statement, parsing the MAGE-TAB/MAGE-ML 
>>>>> into RDF would result in obtaining only the raw and processed data 
>>>>> files but not the mechanism used to process it nor the resulting 
>>>>> gene list. That's also what I concluded after looking at the data 
>>>>> structure created by Tony Burdett's Limpopo parser. However, 
>>>>> having the raw data as linked data is already a great start! Kei, 
>>>>> should I be looking into Taverna in order to reprocessed the raw 
>>>>> files with a traceable analysis workflow?
>>>>>
>>>>> Thanks!
>>>>> Lena
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 24, 2009 at 9:59 AM, mdmiller <mdmiller53@comcast.net 
>>>>> <mailto:mdmiller53@comcast.net>> wrote:
>>>>>
>>>>>  hi all,
>>>>>
>>>>>  (from the minutes)
>>>>>
>>>>>  "Yolanda/Kei/Scott: semantic annotation/description of workflow
>>>>>  would enable the retrieval of data relevant to that workflow (i.e.
>>>>>  data that could be used to populate that workflow for a different
>>>>>  experimental scenario)"
>>>>>
>>>>>  what is typically in a MAGE-TAB/MAGE-ML document are the protocols
>>>>>  for how the source was processed into the extract then how the
>>>>>  hybridization, feature extraction, error and normalization were
>>>>>  performed. these are interesting and different protocols can
>>>>>  cause differences at this level but it is pretty much a known art
>>>>>  and usually not of too much interest or variability.
>>>>>
>>>>>  what is usually missing from those documents, along with the final
>>>>>  gene list, is how that gene list was obtained, what higher level
>>>>>  analysis was used, that is generally only in the paper 
>>>>> unfortunately.
>>>>>
>>>>>  cheers,
>>>>>  michael
>>>>>  .
>>>>>  ----- Original Message ----- From: "Kei Cheung"
>>>>>
>>>>>  <kei.cheung@yale.edu <mailto:kei.cheung@yale.edu>>
>>>>>  To: "HCLS" <public-semweb-lifesci@w3.org
>>>>>
>>>>>  <mailto:public-semweb-lifesci@w3.org>>
>>>>>  Sent: Monday, November 23, 2009 1:27 PM
>>>>>  Subject: Re: BioRDF Telcon
>>>>>
>>>>>
>>>>>
>>>>>  Today's BioRDF minutes are available at the following:
>>>>>
>>>>>
>>>>> http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Meetings/2009/11-23_Conference_Call 
>>>>>
>>>>>
>>>>>  Thanks to Rob for scribing.
>>>>>
>>>>>  Cheers,
>>>>>
>>>>>  -Kei
>>>>>
>>>>>  Kei Cheung wrote:
>>>>>
>>>>>  This is a reminder that the next BioRDF telcon call will
>>>>>  be held at 11 am EDT (5 pm CET) on Monday, November 23
>>>>>  (see details below).
>>>>>
>>>>>  Cheers,
>>>>>
>>>>>  -Kei
>>>>>
>>>>>  == Conference Details ==
>>>>>  * Date of Call: Monday November 23, 2009
>>>>>  * Time of Call: 11:00 am Eastern Time
>>>>>  * Dial-In #: +1.617.761.6200 (Cambridge, MA)
>>>>>  * Dial-In #: +33.4.89.06.34.99 (Nice, France)
>>>>>  * Dial-In #: +44.117.370.6152 (Bristol, UK)
>>>>>  * Participant Access Code: 4257 ("HCLS")
>>>>>
>>>>>  * IRC Channel: irc.w3.org <http://irc.w3.org> port 6665
>>>>>  channel #
>>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
Received on Monday, 14 December 2009 02:33:24 UTC