- From: Kei Cheung <kei.cheung@yale.edu>
- Date: Sun, 13 Dec 2009 21:32:41 -0500
- To: mdmiller <mdmiller53@comcast.net>
- CC: Jim McCusker <james.mccusker@yale.edu>, Helena Deus <helenadeus@gmail.com>, HCLS <public-semweb-lifesci@w3.org>
Hi Michael,
Thanks for pointing to MSigDB. I inlucded this in the related links
section of the microarray use case description
(http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/QueryFederation2). Also
please see my response below.
Cheers,
-Kei
mdmiller wrote:
> hi all,
>
> here is he link to Molecular Signatures Database (MSigDB):
>
> [1]: http://www.broadinstitute.org/gsea/msigdb/
>
> cheers,
> michael
>
> ----- Original Message ----- From: "mdmiller" <mdmiller53@comcast.net>
> To: "Kei Cheung" <kei.cheung@yale.edu>
> Cc: "Jim McCusker" <james.mccusker@yale.edu>; "Helena Deus"
> <helenadeus@gmail.com>; "HCLS" <public-semweb-lifesci@w3.org>
> Sent: Thursday, December 10, 2009 6:49 AM
> Subject: Re: BioRDF Telcon
>
>
>> hi kei,
>>
>>> To me, ontologies can be used to facilitate integrated semantic
>>> queries across experiments/datasets.
>>
>>
>> yes, and this is starting to become a reality. this effort, along
>> with other HCLS initiatives are helping to pave the way.
>>
>>> While some of the protocols are standardized, the data protocols for
>>> obtaining things like gene lists vary a lot. One of my questions is
>>> that can such data analysis protocols be somehow entered into mage-tab.
>>
>>
>> yes it can be, along with the gene list, but in practice this is not
>> done by the submitter. after the Derived Array Data representing the
>> normalized data, like CHP files, there can be one or more Protocol
>> REF columns describing the analysis to obtain the gene list followed
>> by a Derived Array Data Matrix File that is the gene list with its
>> signature.
>>
>> perhaps MIAME needs to be extended to state this. it's something
>> i'll be bringing up with the MGED board. it's just now that this has
>> become something of value to be machine readable. besides GeneSigDB,
>> there is another effort, MSiqDB [1], that is also curating gene
>> lists. so the community is beginning to see the value of this.
>>
Yes, for these gene lists to be of value to researchers, rich annotation
is key. The challenge here is that it's quite tedious to enter the
custom data analysis protocols in a structured way by hand.
>>> At least for now, I don't think we need to convert the huge primary
>>> data files (e.g., CEL file) into RDF. For the time being, we are
>>> more focused on the processed gene lists that may be associated with
>>> more biological meanings.
>>
>>
>> perhaps its worthwhile considering using an ontology 'raw data' class
>> for raw data that contains a reference to the data file. one could
>> then use appropriate analysis tools to produce normalized data which
>> could then also be referenced by a 'normalized data' class.
>
It seems to make sense.
>>
>> cheers,
>> michael
>>
>> ----- Original Message ----- From: "Kei Cheung" <kei.cheung@yale.edu>
>> To: "mdmiller" <mdmiller53@comcast.net>
>> Cc: "Jim McCusker" <james.mccusker@yale.edu>; "Helena Deus"
>> <helenadeus@gmail.com>; "HCLS" <public-semweb-lifesci@w3.org>
>> Sent: Monday, December 07, 2009 7:32 AM
>> Subject: Re: BioRDF Telcon
>>
>>
>>> mdmiller wrote:
>>>
>>>> hi jim and lena,
>>>>
>>>> great progress! this will be a nice tool.
>>>>
>>>> a couple of comments.
>>>>
>>>> 1) i think ProtocolApplication is based seen as an individual
>>>> instance of the Protocol class. quite often there are arguments
>>>> whether ontologies should have individuals or be simply classes.
>>>> to me, that doesn't apply here where real world objects are being
>>>> connected to ontologies. the BioSource is realized as the 'Source
>>>> Name' column in MAGE-TAB and those entries represent real people in
>>>> studies, mice or rats in non-clinical studies, etc., and the
>>>> characteristics values like age represent real individual instances
>>>> of age. in the same way, the values in the Protocol REF column of
>>>> MAGE-TAB are real wet-lab or analysis individual instances of
>>>> protocols, called protocol applications in MAGE-OM.
>>>
>>> It sounds like we need to look at how to map column names and
>>> entries to classes, instances, and relationships appropriately.
>>>
>>>>
>>>> failure to make this distinction, to me, has obscured how much
>>>> value ontologies can have in the real world. too often i see
>>>> ontologies seen in and of themselves, which has its own value
>>>> certainly, but not for the use cases i have dealing with real
>>>> biological data.
>>>
>>>
>>> To me, ontologies can be used to facilitate integrated semantic
>>> queries across experiments/datasets.
>>>
>>>>
>>>> 2) the usefulness, for this use case, of the information between
>>>> the 'Source Name' and its characteristics and the 'Derived Array
>>>> Data Matrix File' or 'Derived Array Data File' has limited
>>>> usefulness, error correction and normalization can make some
>>>> difference but if the provider of the MAGE-TAB is trusted, all that
>>>> is pretty routine these days. the above combined with experimental
>>>> factors and experiment design info is probably 95% to 99.9% the
>>>> worthwhile information from the MAGE-TAB. if one notices a
>>>> difference in the final gene set between two experiments that look
>>>> the same, only then it might be worthwhile going into more detail.
>>>>
>>>> and has been noted the MAGE-TAB information needs to be
>>>> supplemented with the information on the final gene set, its
>>>> expression values, and the higher-level level analysis that was
>>>> used, that is buried in the paper usually.
>>>
>>> While some of the protocols are standardized, the data protocols for
>>> obtaining things like gene lists vary a lot. One of my questions is
>>> that can such data analysis protocols be somehow entered into mage-tab.
>>>
>>>>
>>>> 3) i'm not sure if there was a desire to capture the raw data in
>>>> the RDF. that will be, for affymetrix, a million to six million
>>>> probes in the CEL file, even the processed data in the CHP file
>>>> would have 20,000 to 60,000 probe sets. i'm not sure if that is
>>>> the best way to represent that.
>>>
>>> At least for now, I don't think we need to convert the huge primary
>>> data files (e.g., CEL file) into RDF. For the time being, we are
>>> more focused on the processed gene lists that may be associated with
>>> more biological meanings.
>>>
>>> Cheers,
>>>
>>> -Kei
>>>
>>>>
>>>> cheers,
>>>> michael
>>>>
>>>> Michael Miller
>>>> mdmiller53@comcast.net
>>>>
>>>> ----- Original Message ----- From: "Jim McCusker"
>>>> <james.mccusker@yale.edu>
>>>> To: "Helena Deus" <helenadeus@gmail.com>
>>>> Cc: "Kei Cheung" <kei.cheung@yale.edu>; "mdmiller"
>>>> <mdmiller53@comcast.net>; "HCLS" <public-semweb-lifesci@w3.org>
>>>> Sent: Monday, November 30, 2009 8:19 AM
>>>> Subject: Re: BioRDF Telcon
>>>>
>>>>
>>>> I'm following a similar strategy, but have been folowing the MGED
>>>> ontology where possible. I've finished aligning the IDF portion, and
>>>> have started on SDRF. MGED ontology is missing a property and class
>>>> for what is often termed as ProtocolApplication, which usually serves
>>>> as an edge between derived from and derived nodes, while linking to
>>>> the protocol used for the derivation. I am planning on creating this
>>>> link in a MAGE extensions ontology, but would like to vet the
>>>> structure here:
>>>>
>>>> ProtocolApplication is a class.
>>>>
>>>> New properties:
>>>>
>>>> has_derivation_source
>>>> has_derivative
>>>>
>>>> And then ProtocolApplication would have the restrictions:
>>>>
>>>> has_protocol some Protocol
>>>>
>>>> I don't put, domains, etc. on the derived properties to allow use in
>>>> directly describing derivations if people so choose. There is no
>>>> superclass for all nodes that can be derived or derived from, so I'm
>>>> not bothering with restrictions for those, although I could add a
>>>> union restriction to it.
>>>>
>>>> If this structure us acceptable to people, I can publish the ontology
>>>> for general use pretty quickly, and let us work from the same data
>>>> structure. I would appreciate any feedback.
>>>>
>>>> Jim
>>>>
>>>> On Monday, November 30, 2009, Helena Deus <helenadeus@gmail.com>
>>>> wrote:
>>>>
>>>>> @Kei,
>>>>>
>>>>>
>>>>>
>>>>> When you said data structure, did you mean the RDF structure
>>>>> For now, all I have is the java object returned by parser. I've
>>>>> been using Limpopo, which creates an object that I can then parse
>>>>> to RDF uing Jena. The challenge, though, has been coming up with
>>>>> the predicates to formalize the relationships between the various
>>>>> elements. I'm using the XML structures fir IDF/SDRF etc. at
>>>>> http://magetab-om.sourceforge.net to automatically generate the
>>>>> structure that will contain the data. My plan is to then create
>>>>> the RDF triples that use the attributes described in those
>>>>> documents and populate them with the data from the MAGE-TAB java
>>>>> object created by Limpopo.
>>>>>
>>>>> Right now all I have is a very raw RDF/XML document describing the
>>>>> relationships in the IDF structure:
>>>>> http://magetab2rdf.googlecode.com/svn/trunk/magetabpredicates.rdf
>>>>> The triples for that had to be encoded manually using Jena by
>>>>> reading the model.
>>>>> @Satya and Jun
>>>>> I would very much like to be involved in that effort, do you
>>>>> already have a URL that I can look at?
>>>>>
>>>>> ThanksLena
>>>>> On Tue, Nov 24, 2009 at 2:19 PM, Kei Cheung <kei.cheung@yale.edu>
>>>>> wrote:
>>>>> Hi Lena et al,
>>>>>
>>>>> When you said data structure, did you mean the RDF structure. If
>>>>> so, is a pointer to the structure that we can look at?
>>>>>
>>>>> As discussed during yesterday's call, Jun and Satya will help
>>>>> create a wiki page for listing some of the requirements for
>>>>> provenance/workflow in the context of gene lists, perhaps we
>>>>> should also use it to help coordinate some of the future
>>>>> activities (people also brought up Taverna during the call
>>>>> yesterday). Please coordinate with Satya and Jun.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -Kei
>>>>>
>>>>> Helena Deus wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I apologize for missing the call yesterday! It seems you had a
>>>>> pretty interesting discussion! :-)
>>>>> If I understand Michael's statement, parsing the MAGE-TAB/MAGE-ML
>>>>> into RDF would result in obtaining only the raw and processed data
>>>>> files but not the mechanism used to process it nor the resulting
>>>>> gene list. That's also what I concluded after looking at the data
>>>>> structure created by Tony Burdett's Limpopo parser. However,
>>>>> having the raw data as linked data is already a great start! Kei,
>>>>> should I be looking into Taverna in order to reprocessed the raw
>>>>> files with a traceable analysis workflow?
>>>>>
>>>>> Thanks!
>>>>> Lena
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 24, 2009 at 9:59 AM, mdmiller <mdmiller53@comcast.net
>>>>> <mailto:mdmiller53@comcast.net>> wrote:
>>>>>
>>>>> hi all,
>>>>>
>>>>> (from the minutes)
>>>>>
>>>>> "Yolanda/Kei/Scott: semantic annotation/description of workflow
>>>>> would enable the retrieval of data relevant to that workflow (i.e.
>>>>> data that could be used to populate that workflow for a different
>>>>> experimental scenario)"
>>>>>
>>>>> what is typically in a MAGE-TAB/MAGE-ML document are the protocols
>>>>> for how the source was processed into the extract then how the
>>>>> hybridization, feature extraction, error and normalization were
>>>>> performed. these are interesting and different protocols can
>>>>> cause differences at this level but it is pretty much a known art
>>>>> and usually not of too much interest or variability.
>>>>>
>>>>> what is usually missing from those documents, along with the final
>>>>> gene list, is how that gene list was obtained, what higher level
>>>>> analysis was used, that is generally only in the paper
>>>>> unfortunately.
>>>>>
>>>>> cheers,
>>>>> michael
>>>>> .
>>>>> ----- Original Message ----- From: "Kei Cheung"
>>>>>
>>>>> <kei.cheung@yale.edu <mailto:kei.cheung@yale.edu>>
>>>>> To: "HCLS" <public-semweb-lifesci@w3.org
>>>>>
>>>>> <mailto:public-semweb-lifesci@w3.org>>
>>>>> Sent: Monday, November 23, 2009 1:27 PM
>>>>> Subject: Re: BioRDF Telcon
>>>>>
>>>>>
>>>>>
>>>>> Today's BioRDF minutes are available at the following:
>>>>>
>>>>>
>>>>> http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Meetings/2009/11-23_Conference_Call
>>>>>
>>>>>
>>>>> Thanks to Rob for scribing.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -Kei
>>>>>
>>>>> Kei Cheung wrote:
>>>>>
>>>>> This is a reminder that the next BioRDF telcon call will
>>>>> be held at 11 am EDT (5 pm CET) on Monday, November 23
>>>>> (see details below).
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -Kei
>>>>>
>>>>> == Conference Details ==
>>>>> * Date of Call: Monday November 23, 2009
>>>>> * Time of Call: 11:00 am Eastern Time
>>>>> * Dial-In #: +1.617.761.6200 (Cambridge, MA)
>>>>> * Dial-In #: +33.4.89.06.34.99 (Nice, France)
>>>>> * Dial-In #: +44.117.370.6152 (Bristol, UK)
>>>>> * Participant Access Code: 4257 ("HCLS")
>>>>>
>>>>> * IRC Channel: irc.w3.org <http://irc.w3.org> port 6665
>>>>> channel #
>>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
Received on Monday, 14 December 2009 02:33:24 UTC