Re: BioRDF Telcon

Hi Michael,

Thanks for your detailed description of mageml. For our use case, we 
probably don't need to use all the information captured in mageml. The 
types of information we are currently focusing on include 
experiment/sample annotation (including some provenance as you 
indicated) and gene lists and how they are linked to existing 
ontologies. A couple of convincing examples may be enough to start. I 
can relay your comments about the validity of mageml to the consortium, 
although I don't know whether they can address them.

Cheers,

-Kei

Miller, Michael D (Rosetta) wrote:

>hi kei and helen,
>
>like helen, i've been following the HCLS working groups with great
>interest.  as one of the designers, with helen, of the MAGE-ML and
>MAGE-TAB specs i might be able to provide a little technical insight
>into the formats.
>
>(from helen)
>"This is probably as we don't have data - here's a list of human 
>experiments with the term neuron - if any of these are useful, then I 
>can prioritize their curation and inclusion in an atlas release"
>
>kei, are the NIH Neuroscience Microarry Consortium exeriments you've
>cited and others like them in GEO or ArrayExpress?  a set of those could
>be a good starting point for helen.
>  
>

My understanding is that the publicly visible mciroarray projects in the 
neuroscience microarray consortium should also be in geo and/or 
arrayexpress, although I don't know whether all the annotations are 
preserved.


>first, MAGE-ML is based on a DTD[1], not an XSD.  in early 2002 as the
>OMG Gene Expression specification[1] was being finalized, XSD was still
>in its infancy so we weren't comfortable at that point generating a XSD.
>the MAGE-OM UML[2], in a very early XMI format from Rational Rose and
>UniSys, was used to generate the DTD with code we wrote ourselves[3]. 
>
>the UML model was designed to capture the flow of a microarray
>experiment and how the resulting arrays were organized in the experiment
>based on how the samples were treated and/or on the samples' phenotypes
>for the purpose of a reviewer understanding the methodology and for a
>researcher replicating and/or re-analyzing the results.  
>
>some of the details of the flow may not be of much interest, i.e. it
>might be worth simply connecting the BioSource elements with their gene
>expression data and not worrying about how the hybridization was
>performed.  but that depends on what you want to do and you know that
>better than i.
>
>also, the data itself are specified in external files, typically in a
>white-space delimited format where the column headers are specified in
>the MAGE-ML file in the QuantitationTypeDimension element and the
>identifiers of the row specified in one of the three
>DesignElementDimension elements, Feature, Reporter, CompositeSequence,
>depending on how derived the data is.  Also the data can be in a vendor
>specific format such as the Affymetrix CEL (since the CEL file
>internally specifies the dimensions often they are left out of the
>MAGE-ML document).
>
>the ExperimentalFactor elements are certainly relevant and if you've
>looked at some of the examples you will noticed that the BioSource
>elements, in particular, and other elements are annotated by
>OntologyEntry elements.  from the gene expression specification:
>
>"OntologyEntry
>A single entry from an ontology or a controlled vocabulary. For
>instance, category
>could be 'species name,' value could be 'homo sapiens' and ontology
>would be
>taxonomy database, NCBI."
>
>for the element an ontology entry element is annotating, we looked at it
>as a way of specifying something like "the object identified by the
>element is an instance of the class/individual specified by the
>OntologyEntry"
>
>so from "kitm-affy-droso-176167" one sees that the BioSource is an
>"instance of" Drosophila, whole animal, whole head and an age of 3 days:
>
>         <BioSource
>identifier="arrayconsortium.tgen.org::biosource.181527" name="Oregon R
>head 3d">
>            <Characteristics_assnlist>
>               <OntologyEntry category="Organism" value="Drosophila"
>description="Drosophila">
>                  <OntologyReference_assn>
>                     <DatabaseEntry accession="#Organism"
>URI="http://mged.sourceforge.net/ontologies/MGEDontology.php#Organism">
>                        <Database_assnref>
>                           <Database_ref identifier="MO"/>
>                        </Database_assnref>
>                     </DatabaseEntry>
><!-- snip -->
>                  </OntologyReference_assn>
>               </OntologyEntry>
>               <OntologyEntry category="OrganismPart" value="whole
>animal" description="">
>                  <OntologyReference_assn>
>                     <DatabaseEntry accession="#OrganismPart"
>URI="http://mged.sourceforge.net/ontologies/MGEDontology.php#OrganismPar
>t">
>                        <Database_assnref>
>                           <Database_ref identifier="MO"/>
>                        </Database_assnref>
>                     </DatabaseEntry>
>                  </OntologyReference_assn>
><!-- snip -->
>               </OntologyEntry>
>               <OntologyEntry category="OrganismPartRegion" value="whole
>head" description="">
><!-- snip -->
>               </OntologyEntry>
><!-- snip -->
>               <OntologyEntry category="Age" value="Age">
>                  <OntologyReference_assn>
>                     <DatabaseEntry accession="#Age"
>URI="http://mged.sourceforge.net/ontologies/MGEDontology.php#Age">
>                        <Database_assnref>
>                           <Database_ref identifier="MO"/>
>                        </Database_assnref>
>                     </DatabaseEntry>
>                  </OntologyReference_assn>
>                  <Associations_assnlist>
>                     <OntologyEntry category="has_measurement"
>value="has_measurement">
>                        <OntologyReference_assn>
>                           <DatabaseEntry accession="#has_measurement"
>URI="http://mged.sourceforge.net/ontologies/MGEDontology.php#has_measure
>ment">
>                              <Database_assnref>
>                                 <Database_ref identifier="MO"/>
>                              </Database_assnref>
>                           </DatabaseEntry>
>                        </OntologyReference_assn>
>                        <Associations_assnlist>
>                           <OntologyEntry category="Measurement"
>value="Measurement">
>                              <OntologyReference_assn>
>                                 <DatabaseEntry accession="#Measurement"
>URI="http://mged.sourceforge.net/ontologies/MGEDontology.php#Measurement
>">
>                                    <Database_assnref>
>                                       <Database_ref identifier="MO"/>
>                                    </Database_assnref>
>                                 </DatabaseEntry>
>                              </OntologyReference_assn>
>                              <Associations_assnlist>
>                                 <OntologyEntry category="has_value"
>value="has_value">
>                                    <OntologyReference_assn>
>                                       <DatabaseEntry
>accession="#has_value"
>URI="http://mged.sourceforge.net/ontologies/MGEDontology.php#has_value">
>                                          <Database_assnref>
>                                             <Database_ref
>identifier="MO"/>
>                                          </Database_assnref>
>                                       </DatabaseEntry>
>                                    </OntologyReference_assn>
>                                    <Associations_assnlist>
>                                       <OntologyEntry
>category="has_value" value="3"/>
>                                    </Associations_assnlist>
>                                 </OntologyEntry>
>                                 <OntologyEntry category="has_units"
>value="has_units">
>                                    <OntologyReference_assn>
>                                       <DatabaseEntry
>accession="#has_units"
>URI="http://mged.sourceforge.net/ontologies/MGEDontology.php#has_units">
>                                          <Database_assnref>
>                                             <Database_ref
>identifier="MO"/>
>                                          </Database_assnref>
>                                       </DatabaseEntry>
>                                    </OntologyReference_assn>
>                                    <Associations_assnlist>
>                                       <OntologyEntry
>category="TimeUnit" value="days" description="24 hours, time unit">
>                                          <OntologyReference_assn>
>                                             <DatabaseEntry
>accession="#days"
>URI="http://mged.sourceforge.net/ontologies/MGEDontology.php#days">
>                                                <Database_assnref>
>                                                   <Database_ref
>identifier="MO"/>
>                                                </Database_assnref>
>                                             </DatabaseEntry>
>                                          </OntologyReference_assn>
>                                       </OntologyEntry>
>                                    </Associations_assnlist>
>                                 </OntologyEntry>
>                              </Associations_assnlist>
>                           </OntologyEntry>
>                        </Associations_assnlist>
>                     </OntologyEntry>
>                  </Associations_assnlist>
>               </OntologyEntry>
><!-- snip -->
>            </Characteristics_assnlist>
><!-- snip -->
>         </BioSource>
>
>by the by, the MAGE-ML examples i've looked at from the NIH Neuroscience
>Microarry Consortium are not in a valid MAGE-ML.dtd format.  i'll send a
>follow-up e-mail dealing with the problems i see.  they are not far off
>but are invalid in a number of places.
>
>cheers,
>michael
>
>Michael Miller
>Lead Software Developer
>Rosetta Biosoftware Business Unit
>www.rosettabio.com
>
>[1] http://www.omg.org/spec/GENE/1.1/
>
>(sadly, the original links to the MAGEstk appear to be broken, this
>mirror site still has the MAGE related files built up over the years,
>here's my best guess as to the most helpful for the references)
>[2]
>http://www.mirrorservice.org/sites/download.sourceforge.net/pub/sourcefo
>rge/m/mg/mged/ 	
>	v1.0:
>http://www.mirrorservice.org/sites/download.sourceforge.net/pub/sourcefo
>rge/m/mg/mged/MAGE-2002-01-07.xmi.gz/MAGE-2002-01-07.xmi
>	v1.1:
>http://www.mirrorservice.org/sites/download.sourceforge.net/pub/sourcefo
>rge/m/mg/mged/MAGE.xmi.gz[peek]
>[3]
>http://www.mirrorservice.org/sites/download.sourceforge.net/pub/sourcefo
>rge/m/mg/mged/MAGE%20Java%20API/20010911/
>
>
>  
>
>>-----Original Message-----
>>From: public-semweb-lifesci-request@w3.org 
>>[mailto:public-semweb-lifesci-request@w3.org] On Behalf Of 
>>Helen Parkinson
>>Sent: Wednesday, July 22, 2009 2:55 AM
>>To: Kei Cheung
>>Cc: HCLS; James Malone
>>Subject: Re: BioRDF Telcon
>>
>>Responses in line.
>>
>>
>>    
>>
>>>>1. We have text mined much of the Affymetrix GEO data, 
>>>>        
>>>>
>>curated it and 
>>    
>>
>>>>imported it into  ArrayExpress - there is now much better sample 
>>>>annotation than the native data in GEO. We also are 
>>>>        
>>>>
>>running QC across 
>>    
>>
>>>>all the data files so we know which should be excluded for future 
>>>>analyses.
>>>>        
>>>>
>>>I think it's the right thing to do both to enrich data 
>>>      
>>>
>>annotation and 
>>    
>>
>>>to enhance data quality. This will help data integration a lot.
>>>      
>>>
>>>Currently, we are exploring query federation in the neuroscience 
>>>context. It'd be great if we can use the neuroscience use 
>>>      
>>>
>>case(s) to 
>>    
>>
>>>help drive your ontology development for text mining and data 
>>>visualization. In addition to the NIH neuroscience microarray 
>>>consortium, it may be possible to collaborate with the Neuroscience 
>>>Information Framework (NIF) to see if we can utilize some of its 
>>>resources (e.g., neuron ontology).
>>>      
>>>
>>Re-use of the neuron ontology is possible, but it depends on whether 
>>there is available data to annotate either in ArrayExpress or GEO. If 
>>you can get me a list of experiments accessions or pubmed ids 
>>I can see 
>>if this is feasible
>>    
>>
>>>>3. We have summary level data of genes x conditions for 
>>>>        
>>>>
>>~30,000 hybs 
>>    
>>
>>>>worth of data in our gene expression atlas with p values 
>>>>        
>>>>
>>indicating 
>>    
>>
>>>>relative under/over-expression. We are planning to export these as 
>>>>triples as soon as we publish the atlas - these may be of 
>>>>        
>>>>
>>interest. 
>>    
>>
>>>>www.ebi.ac.uk/gxa - there's an API at present, but it will be 
>>>>improved in the next month or so.
>>>>        
>>>>
>>>It fits well with what we're currently exploring in terms 
>>>      
>>>
>>of gene list 
>>    
>>
>>>representation and linking genes and samples to existing 
>>>      
>>>
>>ontologies. 
>>    
>>
>>>It'd be great if we can download or fetch RDF triples from 
>>>      
>>>
>>EBI atlas.
>>We have a student starting work on this in a month, if you 
>>can produce 
>>concrete use cases for how you want to access these data we can do 
>>something.
>>    
>>
>>>>4. If neuroscience data is of specific interest we could 
>>>>        
>>>>
>>do a themed 
>>    
>>
>>>>atlas release where we add datasets for a given community 
>>>>        
>>>>
>>or project 
>>    
>>
>>>>and make these available. These can be identified by 
>>>>        
>>>>
>>ArrayExpress or 
>>    
>>
>>>>GEO accession or pubmed and we can re-annotate the genes vs 
>>>>Uniprot/Ensembl, add GO terms, etc and curate the sample 
>>>>        
>>>>
>>attributes 
>>    
>>
>>>>and experimental variables. These pipelines are already in 
>>>>        
>>>>
>>place as 
>>    
>>
>>>>part of our production workflow.
>>>>        
>>>>
>>>I think it's a great idea to do a themed atlas (e.g., 
>>>      
>>>
>>neuro-atlas). I 
>>    
>>
>>>just played with gxa a little bit. It's nice! For example, I could 
>>>find genes that are over-expressed in the hippocampus brain region 
>>>across different experiments. However, when I tried to do the same 
>>>thing for neurons, there are only a few neuron types that I can 
>>>select. It'd be nice if we can have more neuron types, for instance.
>>>      
>>>
>>This is probably as we don't have data - here's a list of human 
>>experiments with the term neuron - if any of these are useful, then I 
>>can prioritise their curation and inclusion in an atlas release
>>
>> 
>>http://www.ebi.ac.uk/microarray-as/ae/browse.html?keywords=neu
>>    
>>
>ron&species=Homo+sapiens&array=&exptype=&pagesize=25>
>&sortby=releasedate&sortorder=descending
>  
>
>>and brain
>>
>>http://www.ebi.ac.uk/microarray-as/ae/browse.html?keywords=bra
>>    
>>
>in&species=Homo+sapiens&array=&exptype=&pagesize=25>
>&sortby=releasedate&sortorder=descending
>  
>
>>>>I'd be very happy to collaborate, and for this group to 
>>>>        
>>>>
>>use our data, 
>>    
>>
>>>>we spend a lot of time adding semantic value to it, so 
>>>>        
>>>>
>>please let me 
>>    
>>
>>>>know if this is of interest
>>>>        
>>>>
>>>We are also looking into the possibility of establishing 
>>>      
>>>
>>collaboration 
>>    
>>
>>>with the scientific discourse task force based on the 
>>>      
>>>
>>microarray use 
>>    
>>
>>>case. We're planning to have a microarray-related presentation and 
>>>discussion on Aug. 31 (Monday, 11 am EDT/5 pm CET). Details will be 
>>>announced later. It'd be great if you can join the BioRDF call to 
>>>participate in the discussion.
>>>
>>>Cheers,
>>>
>>>-Kei
>>>      
>>>
>>>>best regards
>>>>
>>>>Helen
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>Kei Cheung wrote:
>>>>        
>>>>
>>>>>The minutes for yesterday's BioRDF call are available at:
>>>>>
>>>>>
>>>>>          
>>>>>
>http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Meetings/2009-07-20_Confe
>rence_Call 
>  
>
>>>>>Thanks to Lena for scribing and Eric for retrieving the 
>>>>>          
>>>>>
>>transcript 
>>    
>>
>>>>>from the IRC log.
>>>>>
>>>>>Cheers,
>>>>>
>>>>>-Kei
>>>>>
>>>>>Kei Cheung wrote:
>>>>>          
>>>>>
>>>>>>This is a reminder that the next BioRDF teleconf. will 
>>>>>>            
>>>>>>
>>be held at 
>>    
>>
>>>>>>11 am EDT (5 pm CET) on Monday, July 20 (see details below).
>>>>>>
>>>>>>I created the following wiki page for discussing the 
>>>>>>            
>>>>>>
>>microarray use 
>>    
>>
>>>>>>case:
>>>>>>
>>>>>>http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/QueryFederation2
>>>>>>
>>>>>>Cheers,
>>>>>>
>>>>>>-Kei
>>>>>>
>>>>>>== Conference Details ==
>>>>>>* Date of Call: Monday July 20, 2009
>>>>>>* Time of Call: 11:00 am Eastern Time
>>>>>>* Dial-In #: +1.617.761.6200 (Cambridge, MA)
>>>>>>* Dial-In #: +33.4.89.06.34.99 (Nice, France)
>>>>>>* Dial-In #: +44.117.370.6152 (Bristol, UK)
>>>>>>* Participant Access Code: 4257 ("HCLS")
>>>>>>* IRC Channel: irc.w3.org port 6665 channel #hcls (see 
>>>>>>[http://www.w3.org/Project/IRC/ W3C IRC page] for 
>>>>>>            
>>>>>>
>>details, or see 
>>    
>>
>>>>>>[http://cgi.w3.org/member-bin/irc/irc.cgi Web IRC])
>>>>>>* Duration: ~1 hour
>>>>>>* Frequency: bi-weekly
>>>>>>* Convener: Kei Cheung
>>>>>>
>>>>>>== Agenda ==
>>>>>>* Roll call and introduction (Kei)
>>>>>>* TCM data quick update (Jun, Kei)
>>>>>>* Query federation use case expanison (microarray) (All)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>
>>>>>          
>>>>>
>>-- 
>>Helen Parkinson, PhD
>>ArrayExpress Production Coordinator,
>>Microarray Informatics Team, 
>>EBI
>>
>>EBI 01223 494672
>>Skype: helen.parkinson.ebi
>>
>>
>>
>>    
>>
>
>  
>

Received on Friday, 24 July 2009 18:34:35 UTC