Re: BioRDF Telcon from mdmiller on 2009-12-10 (public-semweb-lifesci@w3.org from December 2009)

From: mdmiller <mdmiller53@comcast.net>
Date: Thu, 10 Dec 2009 06:49:47 -0800
To: "Kei Cheung" <kei.cheung@yale.edu>
Cc: "Jim McCusker" <james.mccusker@yale.edu>, "Helena Deus" <helenadeus@gmail.com>, "HCLS" <public-semweb-lifesci@w3.org>
Message-ID: <C55678C9C3804A09A25EC13B16FF5D95@mmPC>
hi kei,

> To me, ontologies can be used to facilitate integrated semantic queries 
> across experiments/datasets.

yes, and this is starting to become a reality.  this effort, along with 
other HCLS initiatives are helping to pave the way.

> While some of the protocols are standardized, the data protocols for 
> obtaining things like gene lists vary a lot. One of my questions is that 
> can such data analysis protocols be somehow entered into mage-tab.

yes it can be, along with the gene list, but in practice this is not done by 
the submitter.  after the Derived Array Data representing the normalized 
data, like CHP files, there can be one or more Protocol REF columns 
describing the analysis to obtain the gene list followed by a Derived Array 
Data Matrix File that is the gene list with its signature.

perhaps MIAME needs to be extended to state this.  it's something i'll be 
bringing up with the MGED board.  it's just now that this has become 
something of value to be machine readable.  besides GeneSigDB, there is 
another effort, MSiqDB [1], that is also curating gene lists.  so the 
community is beginning to see the value of this.

> At least for now, I don't think we need to convert the huge primary data 
> files (e.g., CEL file) into RDF. For the time being, we are more focused 
> on the processed gene lists that may be associated with more biological 
> meanings.

perhaps its worthwhile considering using an ontology 'raw data' class for 
raw data that contains a reference to the data file.  one could then use 
appropriate analysis tools to produce normalized data which could then also 
be referenced by a 'normalized data' class.

cheers,
michael

----- Original Message ----- 
From: "Kei Cheung" <kei.cheung@yale.edu>
To: "mdmiller" <mdmiller53@comcast.net>
Cc: "Jim McCusker" <james.mccusker@yale.edu>; "Helena Deus" 
<helenadeus@gmail.com>; "HCLS" <public-semweb-lifesci@w3.org>
Sent: Monday, December 07, 2009 7:32 AM
Subject: Re: BioRDF Telcon


> mdmiller wrote:
>> hi jim and lena,
>>
>> great progress!  this will be a nice tool.
>>
>> a couple of comments.
>>
>> 1) i think ProtocolApplication is based seen as an individual instance of 
>> the Protocol class.  quite often there are arguments whether ontologies 
>> should have individuals or be simply classes.  to me, that doesn't apply 
>> here where real world objects are being connected to ontologies.  the 
>> BioSource is realized as the  'Source Name' column in MAGE-TAB and those 
>> entries represent real people in studies, mice or rats in non-clinical 
>> studies, etc., and the characteristics values like age represent real 
>> individual instances of age.  in the same way, the values in the Protocol 
>> REF column of MAGE-TAB are real wet-lab or analysis individual instances 
>> of protocols, called protocol applications in MAGE-OM.
> It sounds like we need to look at how to map column names and entries to 
> classes, instances, and relationships appropriately.
>>
>> failure to make this distinction, to me, has obscured how much value 
>> ontologies can have in the real world.  too often i see ontologies seen 
>> in and of themselves, which has its own value certainly, but not for the 
>> use cases i have dealing with real biological data.
>
> To me, ontologies can be used to facilitate integrated semantic queries 
> across experiments/datasets.
>>
>> 2) the usefulness, for this use case, of the information between the 
>> 'Source Name' and its characteristics and the 'Derived Array Data Matrix 
>> File' or 'Derived Array Data File' has limited usefulness, error 
>> correction and normalization can make some difference but if the provider 
>> of the MAGE-TAB is trusted, all that is pretty routine these days.  the 
>> above combined with experimental factors and experiment design info is 
>> probably 95% to 99.9% the worthwhile information from the MAGE-TAB.  if 
>> one notices a difference in the final gene set between two experiments 
>> that look the same, only then it might be worthwhile going into more 
>> detail.
>>
>> and has been noted the MAGE-TAB information needs to be supplemented with 
>> the information on the final gene set, its expression values, and the 
>> higher-level level analysis that was used, that is buried in the paper 
>> usually.
> While some of the protocols are standardized, the data protocols for 
> obtaining things like gene lists vary a lot. One of my questions is that 
> can such data analysis protocols be somehow entered into mage-tab.
>>
>> 3) i'm not sure if there was a desire to capture the raw data in the RDF. 
>> that will be, for affymetrix, a million to six million probes in the CEL 
>> file, even the processed data in the CHP file would have 20,000 to 60,000 
>> probe sets.  i'm not sure if that is the best way to represent that.
> At least for now, I don't think we need to convert the huge primary data 
> files (e.g., CEL file) into RDF. For the time being, we are more focused 
> on the processed gene lists that may be associated with more biological 
> meanings.
>
> Cheers,
>
> -Kei
>>
>> cheers,
>> michael
>>
>> Michael Miller
>> mdmiller53@comcast.net
>>
>> ----- Original Message ----- From: "Jim McCusker" 
>> <james.mccusker@yale.edu>
>> To: "Helena Deus" <helenadeus@gmail.com>
>> Cc: "Kei Cheung" <kei.cheung@yale.edu>; "mdmiller" 
>> <mdmiller53@comcast.net>; "HCLS" <public-semweb-lifesci@w3.org>
>> Sent: Monday, November 30, 2009 8:19 AM
>> Subject: Re: BioRDF Telcon
>>
>>
>> I'm following a similar strategy, but have been folowing the MGED
>> ontology where possible. I've finished aligning the IDF portion, and
>> have started on SDRF. MGED ontology is missing a property and class
>> for what is often termed as ProtocolApplication, which usually serves
>> as an edge between derived from and derived nodes, while linking to
>> the protocol used for the derivation. I am planning on creating this
>> link in a MAGE extensions ontology, but would like to vet the
>> structure here:
>>
>> ProtocolApplication is a class.
>>
>> New properties:
>>
>> has_derivation_source
>> has_derivative
>>
>> And then ProtocolApplication would have the restrictions:
>>
>> has_protocol some Protocol
>>
>> I don't put, domains, etc. on the derived properties to allow use in
>> directly describing derivations if people so choose. There is no
>> superclass for all nodes that can be derived or derived from, so I'm
>> not bothering with restrictions for those, although I could add a
>> union restriction to it.
>>
>> If this structure us acceptable to people, I can publish the ontology
>> for general use pretty quickly, and let us work from the same data
>> structure. I would appreciate any feedback.
>>
>> Jim
>>
>> On Monday, November 30, 2009, Helena Deus <helenadeus@gmail.com> wrote:
>>> @Kei,
>>>
>>>
>>>
>>> When you said data structure, did you mean the RDF structure
>>> For now, all I have is the java object returned by parser. I've been 
>>> using Limpopo, which creates an object that I can then parse to RDF uing 
>>> Jena. The challenge, though, has been coming up with the predicates to 
>>> formalize the relationships between the various elements. I'm using the 
>>> XML structures fir IDF/SDRF etc. at http://magetab-om.sourceforge.net to 
>>> automatically generate the structure that will contain the data. My plan 
>>> is to then create the RDF triples that use the attributes described in 
>>> those documents and populate them with the data from the MAGE-TAB java 
>>> object created by Limpopo.
>>>
>>> Right now all I have is a very raw RDF/XML document describing the 
>>> relationships in the IDF structure: 
>>> http://magetab2rdf.googlecode.com/svn/trunk/magetabpredicates.rdf
>>> The triples for that had to be encoded manually using Jena by reading 
>>> the model.
>>> @Satya and Jun
>>> I would very much like to be involved in that effort, do you already 
>>> have a URL that I can look at?
>>>
>>> ThanksLena
>>> On Tue, Nov 24, 2009 at 2:19 PM, Kei Cheung <kei.cheung@yale.edu> wrote:
>>> Hi Lena et al,
>>>
>>> When you said data structure, did you mean the RDF structure. If so, is 
>>> a pointer to the structure that we can look at?
>>>
>>> As discussed during yesterday's call, Jun and Satya will help create a 
>>> wiki page for listing some of the requirements for provenance/workflow 
>>> in the context of gene lists, perhaps we should also use it to help 
>>> coordinate some of the future activities (people also brought up Taverna 
>>> during the call yesterday). Please coordinate with Satya and Jun.
>>>
>>> Cheers,
>>>
>>> -Kei
>>>
>>> Helena Deus wrote:
>>>
>>> Hi all,
>>>
>>> I apologize for missing the call yesterday! It seems you had a pretty 
>>> interesting discussion! :-)
>>> If I understand Michael's statement, parsing the MAGE-TAB/MAGE-ML into 
>>> RDF would result in obtaining only the raw and processed data files but 
>>> not the mechanism used to process it nor the resulting gene list. That's 
>>> also what I concluded after looking at the data structure created by 
>>> Tony Burdett's Limpopo parser. However, having the raw data as linked 
>>> data is already a great start! Kei, should I be looking into Taverna in 
>>> order to reprocessed the raw files with a traceable analysis workflow?
>>>
>>> Thanks!
>>> Lena
>>>
>>>
>>>
>>>
>>> On Tue, Nov 24, 2009 at 9:59 AM, mdmiller <mdmiller53@comcast.net 
>>> <mailto:mdmiller53@comcast.net>> wrote:
>>>
>>>  hi all,
>>>
>>>  (from the minutes)
>>>
>>>  "Yolanda/Kei/Scott: semantic annotation/description of workflow
>>>  would enable the retrieval of data relevant to that workflow (i.e.
>>>  data that could be used to populate that workflow for a different
>>>  experimental scenario)"
>>>
>>>  what is typically in a MAGE-TAB/MAGE-ML document are the protocols
>>>  for how the source was processed into the extract then how the
>>>  hybridization, feature extraction, error and normalization were
>>>  performed. these are interesting and different protocols can
>>>  cause differences at this level but it is pretty much a known art
>>>  and usually not of too much interest or variability.
>>>
>>>  what is usually missing from those documents, along with the final
>>>  gene list, is how that gene list was obtained, what higher level
>>>  analysis was used, that is generally only in the paper unfortunately.
>>>
>>>  cheers,
>>>  michael
>>>  .
>>>  ----- Original Message ----- From: "Kei Cheung"
>>>
>>>  <kei.cheung@yale.edu <mailto:kei.cheung@yale.edu>>
>>>  To: "HCLS" <public-semweb-lifesci@w3.org
>>>
>>>  <mailto:public-semweb-lifesci@w3.org>>
>>>  Sent: Monday, November 23, 2009 1:27 PM
>>>  Subject: Re: BioRDF Telcon
>>>
>>>
>>>
>>>  Today's BioRDF minutes are available at the following:
>>>
>>>
>>> http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Meetings/2009/11-23_Conference_Call
>>>
>>>  Thanks to Rob for scribing.
>>>
>>>  Cheers,
>>>
>>>  -Kei
>>>
>>>  Kei Cheung wrote:
>>>
>>>  This is a reminder that the next BioRDF telcon call will
>>>  be held at 11 am EDT (5 pm CET) on Monday, November 23
>>>  (see details below).
>>>
>>>  Cheers,
>>>
>>>  -Kei
>>>
>>>  == Conference Details ==
>>>  * Date of Call: Monday November 23, 2009
>>>  * Time of Call: 11:00 am Eastern Time
>>>  * Dial-In #: +1.617.761.6200 (Cambridge, MA)
>>>  * Dial-In #: +33.4.89.06.34.99 (Nice, France)
>>>  * Dial-In #: +44.117.370.6152 (Bristol, UK)
>>>  * Participant Access Code: 4257 ("HCLS")
>>>
>>>  * IRC Channel: irc.w3.org <http://irc.w3.org> port 6665
>>>  channel #
>>>
>>
>
>
>
Received on Thursday, 10 December 2009 14:50:32 UTC