Re: BioRDF Telcon from mdmiller on 2009-12-12 (public-semweb-lifesci@w3.org from December 2009)

From: mdmiller <mdmiller53@comcast.net>
Date: Sat, 12 Dec 2009 07:47:41 -0800
To: "Kei Cheung" <kei.cheung@yale.edu>
Cc: "Jim McCusker" <james.mccusker@yale.edu>, "Helena Deus" <helenadeus@gmail.com>, "HCLS" <public-semweb-lifesci@w3.org>
Message-ID: <97256F24A8394386A246952F524C727E@mmPC>
hi all,

here is he link to Molecular Signatures Database (MSigDB):

[1]: http://www.broadinstitute.org/gsea/msigdb/

cheers,
michael

----- Original Message ----- 
From: "mdmiller" <mdmiller53@comcast.net>
To: "Kei Cheung" <kei.cheung@yale.edu>
Cc: "Jim McCusker" <james.mccusker@yale.edu>; "Helena Deus" 
<helenadeus@gmail.com>; "HCLS" <public-semweb-lifesci@w3.org>
Sent: Thursday, December 10, 2009 6:49 AM
Subject: Re: BioRDF Telcon


> hi kei,
>
>> To me, ontologies can be used to facilitate integrated semantic queries 
>> across experiments/datasets.
>
> yes, and this is starting to become a reality.  this effort, along with 
> other HCLS initiatives are helping to pave the way.
>
>> While some of the protocols are standardized, the data protocols for 
>> obtaining things like gene lists vary a lot. One of my questions is that 
>> can such data analysis protocols be somehow entered into mage-tab.
>
> yes it can be, along with the gene list, but in practice this is not done 
> by the submitter.  after the Derived Array Data representing the 
> normalized data, like CHP files, there can be one or more Protocol REF 
> columns describing the analysis to obtain the gene list followed by a 
> Derived Array Data Matrix File that is the gene list with its signature.
>
> perhaps MIAME needs to be extended to state this.  it's something i'll be 
> bringing up with the MGED board.  it's just now that this has become 
> something of value to be machine readable.  besides GeneSigDB, there is 
> another effort, MSiqDB [1], that is also curating gene lists.  so the 
> community is beginning to see the value of this.
>
>> At least for now, I don't think we need to convert the huge primary data 
>> files (e.g., CEL file) into RDF. For the time being, we are more focused 
>> on the processed gene lists that may be associated with more biological 
>> meanings.
>
> perhaps its worthwhile considering using an ontology 'raw data' class for 
> raw data that contains a reference to the data file.  one could then use 
> appropriate analysis tools to produce normalized data which could then 
> also be referenced by a 'normalized data' class.
>
> cheers,
> michael
>
> ----- Original Message ----- 
> From: "Kei Cheung" <kei.cheung@yale.edu>
> To: "mdmiller" <mdmiller53@comcast.net>
> Cc: "Jim McCusker" <james.mccusker@yale.edu>; "Helena Deus" 
> <helenadeus@gmail.com>; "HCLS" <public-semweb-lifesci@w3.org>
> Sent: Monday, December 07, 2009 7:32 AM
> Subject: Re: BioRDF Telcon
>
>
>> mdmiller wrote:
>>> hi jim and lena,
>>>
>>> great progress!  this will be a nice tool.
>>>
>>> a couple of comments.
>>>
>>> 1) i think ProtocolApplication is based seen as an individual instance 
>>> of the Protocol class.  quite often there are arguments whether 
>>> ontologies should have individuals or be simply classes.  to me, that 
>>> doesn't apply here where real world objects are being connected to 
>>> ontologies.  the BioSource is realized as the  'Source Name' column in 
>>> MAGE-TAB and those entries represent real people in studies, mice or 
>>> rats in non-clinical studies, etc., and the characteristics values like 
>>> age represent real individual instances of age.  in the same way, the 
>>> values in the Protocol REF column of MAGE-TAB are real wet-lab or 
>>> analysis individual instances of protocols, called protocol applications 
>>> in MAGE-OM.
>> It sounds like we need to look at how to map column names and entries to 
>> classes, instances, and relationships appropriately.
>>>
>>> failure to make this distinction, to me, has obscured how much value 
>>> ontologies can have in the real world.  too often i see ontologies seen 
>>> in and of themselves, which has its own value certainly, but not for the 
>>> use cases i have dealing with real biological data.
>>
>> To me, ontologies can be used to facilitate integrated semantic queries 
>> across experiments/datasets.
>>>
>>> 2) the usefulness, for this use case, of the information between the 
>>> 'Source Name' and its characteristics and the 'Derived Array Data Matrix 
>>> File' or 'Derived Array Data File' has limited usefulness, error 
>>> correction and normalization can make some difference but if the 
>>> provider of the MAGE-TAB is trusted, all that is pretty routine these 
>>> days.  the above combined with experimental factors and experiment 
>>> design info is probably 95% to 99.9% the worthwhile information from the 
>>> MAGE-TAB.  if one notices a difference in the final gene set between two 
>>> experiments that look the same, only then it might be worthwhile going 
>>> into more detail.
>>>
>>> and has been noted the MAGE-TAB information needs to be supplemented 
>>> with the information on the final gene set, its expression values, and 
>>> the higher-level level analysis that was used, that is buried in the 
>>> paper usually.
>> While some of the protocols are standardized, the data protocols for 
>> obtaining things like gene lists vary a lot. One of my questions is that 
>> can such data analysis protocols be somehow entered into mage-tab.
>>>
>>> 3) i'm not sure if there was a desire to capture the raw data in the 
>>> RDF. that will be, for affymetrix, a million to six million probes in 
>>> the CEL file, even the processed data in the CHP file would have 20,000 
>>> to 60,000 probe sets.  i'm not sure if that is the best way to represent 
>>> that.
>> At least for now, I don't think we need to convert the huge primary data 
>> files (e.g., CEL file) into RDF. For the time being, we are more focused 
>> on the processed gene lists that may be associated with more biological 
>> meanings.
>>
>> Cheers,
>>
>> -Kei
>>>
>>> cheers,
>>> michael
>>>
>>> Michael Miller
>>> mdmiller53@comcast.net
>>>
>>> ----- Original Message ----- From: "Jim McCusker" 
>>> <james.mccusker@yale.edu>
>>> To: "Helena Deus" <helenadeus@gmail.com>
>>> Cc: "Kei Cheung" <kei.cheung@yale.edu>; "mdmiller" 
>>> <mdmiller53@comcast.net>; "HCLS" <public-semweb-lifesci@w3.org>
>>> Sent: Monday, November 30, 2009 8:19 AM
>>> Subject: Re: BioRDF Telcon
>>>
>>>
>>> I'm following a similar strategy, but have been folowing the MGED
>>> ontology where possible. I've finished aligning the IDF portion, and
>>> have started on SDRF. MGED ontology is missing a property and class
>>> for what is often termed as ProtocolApplication, which usually serves
>>> as an edge between derived from and derived nodes, while linking to
>>> the protocol used for the derivation. I am planning on creating this
>>> link in a MAGE extensions ontology, but would like to vet the
>>> structure here:
>>>
>>> ProtocolApplication is a class.
>>>
>>> New properties:
>>>
>>> has_derivation_source
>>> has_derivative
>>>
>>> And then ProtocolApplication would have the restrictions:
>>>
>>> has_protocol some Protocol
>>>
>>> I don't put, domains, etc. on the derived properties to allow use in
>>> directly describing derivations if people so choose. There is no
>>> superclass for all nodes that can be derived or derived from, so I'm
>>> not bothering with restrictions for those, although I could add a
>>> union restriction to it.
>>>
>>> If this structure us acceptable to people, I can publish the ontology
>>> for general use pretty quickly, and let us work from the same data
>>> structure. I would appreciate any feedback.
>>>
>>> Jim
>>>
>>> On Monday, November 30, 2009, Helena Deus <helenadeus@gmail.com> wrote:
>>>> @Kei,
>>>>
>>>>
>>>>
>>>> When you said data structure, did you mean the RDF structure
>>>> For now, all I have is the java object returned by parser. I've been 
>>>> using Limpopo, which creates an object that I can then parse to RDF 
>>>> uing Jena. The challenge, though, has been coming up with the 
>>>> predicates to formalize the relationships between the various elements. 
>>>> I'm using the XML structures fir IDF/SDRF etc. at 
>>>> http://magetab-om.sourceforge.net to automatically generate the 
>>>> structure that will contain the data. My plan is to then create the RDF 
>>>> triples that use the attributes described in those documents and 
>>>> populate them with the data from the MAGE-TAB java object created by 
>>>> Limpopo.
>>>>
>>>> Right now all I have is a very raw RDF/XML document describing the 
>>>> relationships in the IDF structure: 
>>>> http://magetab2rdf.googlecode.com/svn/trunk/magetabpredicates.rdf
>>>> The triples for that had to be encoded manually using Jena by reading 
>>>> the model.
>>>> @Satya and Jun
>>>> I would very much like to be involved in that effort, do you already 
>>>> have a URL that I can look at?
>>>>
>>>> ThanksLena
>>>> On Tue, Nov 24, 2009 at 2:19 PM, Kei Cheung <kei.cheung@yale.edu> 
>>>> wrote:
>>>> Hi Lena et al,
>>>>
>>>> When you said data structure, did you mean the RDF structure. If so, is 
>>>> a pointer to the structure that we can look at?
>>>>
>>>> As discussed during yesterday's call, Jun and Satya will help create a 
>>>> wiki page for listing some of the requirements for provenance/workflow 
>>>> in the context of gene lists, perhaps we should also use it to help 
>>>> coordinate some of the future activities (people also brought up 
>>>> Taverna during the call yesterday). Please coordinate with Satya and 
>>>> Jun.
>>>>
>>>> Cheers,
>>>>
>>>> -Kei
>>>>
>>>> Helena Deus wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I apologize for missing the call yesterday! It seems you had a pretty 
>>>> interesting discussion! :-)
>>>> If I understand Michael's statement, parsing the MAGE-TAB/MAGE-ML into 
>>>> RDF would result in obtaining only the raw and processed data files but 
>>>> not the mechanism used to process it nor the resulting gene list. 
>>>> That's also what I concluded after looking at the data structure 
>>>> created by Tony Burdett's Limpopo parser. However, having the raw data 
>>>> as linked data is already a great start! Kei, should I be looking into 
>>>> Taverna in order to reprocessed the raw files with a traceable analysis 
>>>> workflow?
>>>>
>>>> Thanks!
>>>> Lena
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Nov 24, 2009 at 9:59 AM, mdmiller <mdmiller53@comcast.net 
>>>> <mailto:mdmiller53@comcast.net>> wrote:
>>>>
>>>>  hi all,
>>>>
>>>>  (from the minutes)
>>>>
>>>>  "Yolanda/Kei/Scott: semantic annotation/description of workflow
>>>>  would enable the retrieval of data relevant to that workflow (i.e.
>>>>  data that could be used to populate that workflow for a different
>>>>  experimental scenario)"
>>>>
>>>>  what is typically in a MAGE-TAB/MAGE-ML document are the protocols
>>>>  for how the source was processed into the extract then how the
>>>>  hybridization, feature extraction, error and normalization were
>>>>  performed. these are interesting and different protocols can
>>>>  cause differences at this level but it is pretty much a known art
>>>>  and usually not of too much interest or variability.
>>>>
>>>>  what is usually missing from those documents, along with the final
>>>>  gene list, is how that gene list was obtained, what higher level
>>>>  analysis was used, that is generally only in the paper unfortunately.
>>>>
>>>>  cheers,
>>>>  michael
>>>>  .
>>>>  ----- Original Message ----- From: "Kei Cheung"
>>>>
>>>>  <kei.cheung@yale.edu <mailto:kei.cheung@yale.edu>>
>>>>  To: "HCLS" <public-semweb-lifesci@w3.org
>>>>
>>>>  <mailto:public-semweb-lifesci@w3.org>>
>>>>  Sent: Monday, November 23, 2009 1:27 PM
>>>>  Subject: Re: BioRDF Telcon
>>>>
>>>>
>>>>
>>>>  Today's BioRDF minutes are available at the following:
>>>>
>>>>
>>>> http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Meetings/2009/11-23_Conference_Call
>>>>
>>>>  Thanks to Rob for scribing.
>>>>
>>>>  Cheers,
>>>>
>>>>  -Kei
>>>>
>>>>  Kei Cheung wrote:
>>>>
>>>>  This is a reminder that the next BioRDF telcon call will
>>>>  be held at 11 am EDT (5 pm CET) on Monday, November 23
>>>>  (see details below).
>>>>
>>>>  Cheers,
>>>>
>>>>  -Kei
>>>>
>>>>  == Conference Details ==
>>>>  * Date of Call: Monday November 23, 2009
>>>>  * Time of Call: 11:00 am Eastern Time
>>>>  * Dial-In #: +1.617.761.6200 (Cambridge, MA)
>>>>  * Dial-In #: +33.4.89.06.34.99 (Nice, France)
>>>>  * Dial-In #: +44.117.370.6152 (Bristol, UK)
>>>>  * Participant Access Code: 4257 ("HCLS")
>>>>
>>>>  * IRC Channel: irc.w3.org <http://irc.w3.org> port 6665
>>>>  channel #
>>>>
>>>
>>
>>
>>
>
>
>
>
Received on Saturday, 12 December 2009 15:48:22 UTC