Re: Playing with sets in OWL...

From: kc28 <kei.cheung@yale.edu>
Date: Sat, 09 Sep 2006 22:22:30 -0400
To: "Miller, Michael D (Rosetta)" <Michael_Miller@Rosettabio.com>
Cc: William Bug <William.Bug@DrexelMed.edu>, Alan Ruttenberg <alanruttenberg@gmail.com>, Marco Brandizi <brandizi@ebi.ac.uk>, semantic-web@w3.org, public-semweb-lifesci@w3.org
Message-id: <450376E6.6050907@yale.edu>

Hi Michael et al,

The following tools, for example, are available for microarray gene 

SOURCE -- http://nar.oxfordjournals.org/cgi/content/full/31/1/219
KARMA -- http://nar.oxfordjournals.org/cgi/content/full/32/suppl_2/W441
RESOURCERER -- http://pga.tigr.org/tigr-scripts/magic/r1.pl
DRAGON -- http://pevsnerlab.kennedykrieger.org/dragon.htm

These tools take a gene list of interest and return annotation collected 
from multiple sources (e.g., gene ontology, UniProt, and KEGG). It might 
be useful if these tools can be made semantic-web-aware.



Miller, Michael D (Rosetta) wrote:

> Hi Bill and Allan,
> You misunderstand my use case.
> My researcher doesn't much care that the world knows about his/her 
> microarray experiment yet--in fact he/she may very well be searching 
> for interesting information about the gene set to see whether it is 
> worth going further or whether the experiment was just retreading old 
> ground or whatever.  There's this new tool, the semantic web, so the 
> researcher is going to submit this set of genes and hopefully get 
> useful information on them as a set.
> Now this researcher probably makes the assumption that as long as the 
> naming source of the genes is indicated, no further work is required.  
> This naming source may very well be GenBank, which, of course, isn't 
> likely to be set up for easy access for pure semantic web tools for 
> many years, if ever, as many people on this list would like but better 
> be supported by the semantic web because for all its faults, and all 
> the faults of the current sequence databases, if the semantic web 
> can't garner information from them I don't see much hope for adoption 
> from the common researcher.
> So perhaps the researcher gets back that the genes are part of a 
> particular pathway, there were a few papers in PubMed that mentions 
> them, some microarray experiments in public repositories had them 
> significantly up or down regulated or effected by some drug but the 
> conclusions for this experiment appear to be worthy so hopefully the 
> experiment will get annotated (with these semantic web results as 
> well), be made part of a submittal, and the experiment itself 
> deposited in a public database to now be accessible for others 
> searching the semantic web.
> "In translating the instance data into OWL, it should then be possible 
> to perform the sort of higher level sorting and re-analysis Alan 
> describes."
> Although this wasn't the use case I was talking about in this thread, 
> it is obviously a very interesting use case also.  I believe in an 
> earlier e-mail I talked about something similar, but one will be 
> unlikely to find out, in general, about the individuals in the 
> experiment (outside of genes), because they will be truly unique 
> instances of things like samples, hybridization and feature extraction 
> and data but, if the researcher annotates these individuals from rich 
> ontologies or from not so great sources that tools are developed to 
> compensate for, then I agree entirely with you and Alan that much 
> useful reasoning can be done on the semantic web.
> "The tendency when presenting these results in research articles - and 
> often when sharing the data - is to provide the analyzed/reduced view 
> of the data"
> Actually, I just heard Leroy Hood of ISB and other fame and Eric 
> Schadt for Rosetta Inpharmatics (our parent company) give excellent 
> talks at the MGED9 meeting where their research is opening up and 
> bringing in information and data from a vast number of resources and 
> tying it together into big pictures, all without the semantic web.  
> I'm sure they would love to have the kind of power envisioned by the 
> W3C for the semantic web but they won't touch it until it is 
> easy--they are busy doing their core jobs.
> So I really think that we need to:
> 1) make sure the semantic web allows people to poke at it, I.e. ask 
> the question is there anything interesting about a particular object, 
> without having to say why they are interested
> 2) provide tools so that they can annotate their objects well so that 
> when they are submitted they can be incorporated into the web (moving 
> forward, this is one aim of the MAGEstk for gene expression experiments)
> 3) provide that existing imperfect resources have semantic web tools 
> that can overcome those imperfections and get the usefulness from them 
> people are currently getting
> 4) most importantly get a useful semantic web out there now, there's 
> plenty of information available, then make it better as time goes along
> The resources that are ready set up for easy integration into the 
> semantic web will come along for free.
> cheers,
> Michael
>     -----Original Message-----
>     *From:* William Bug [mailto:William.Bug@DrexelMed.edu]
>     *Sent:* Friday, September 08, 2006 8:39 PM
>     *To:* Alan Ruttenberg
>     *Cc:* Miller, Michael D (Rosetta); Marco Brandizi;
>     semantic-web@w3.org; public-semweb-lifesci@w3.org
>     *Subject:* Re: Playing with sets in OWL...
>     I think Alan is making a very important general point here.
>     MAGE-ML/MAGE-OM is perfectly tuned to the needs of:
>     a) transferring entire microarray data sets across systems
>     b) persisting microarray data sets (at least in certain scenarios)
>     c) providing a systematic, normative interface for writing code to
>     access specific elements and data collections one typically finds
>     in the description of a microarray data set
>     This is the sort of functionality data models are particularly
>     well suited at supporting.  
>     MAGE-OM/MAGE-ML is also the result of a huge amount of
>     deliberation from dozens of experts in the informatics fields
>     involved in generating, storing, and manipulating microarray data.
>     When it comes to manipulating the information associated with a
>     microarray experiment - or collection of experiments - in a
>     semantically explicitly manner, however, RDF is really the
>     preferred formalism providing the required explicit semantics,
>     while still providing the expressiveness needed to characterize
>     the inherent variety, complexity, and granularity in this
>     information.  When it comes to filling out the assertions to the
>     point of being able to reason on them - even simple reasoning such
>     as consistency checks - some dialect of OWL will be the formalism
>     of choice, I believe.
>     I think Alan gives a very clear example of how to use OWL in this
>     particular situation described by Marco.
>     I have just a few questions in followup:
>     1) The MAGE-ML XML Schema provides for a great deal of flexibility
>     via the use of optional fields.  Still, any given use in a
>     specific lab for a specific collection of microarray experiments
>     is likely to develop it's own conventions for which fields to use
>     and which not to use - and how to populate the more "open"
>     elements.  With this in mind, it seems it should be possible under
>     those circumstances to create an XSLT to translate the individuals
>     contained in a MAGE-ML instance according to the elemental OWL
>     classes Alan described -
>     Expression_technology, Expression_technology_map, Spot_mapping,
>     Expression_profile_experiment, Spot_intensity,
>     Gene_expression_computation.  The latter can probably be
>     reconstituted from the MAGE-ML elements BioAssay, BioAssayData,
>     HigherLevelAnalysis, Measurement, and QuantitationType.  In
>     translating the instance data into OWL, it should then be possible
>     to perform the sort of higher level sorting and re-analysis Alan
>     describes.  The translation should probably take the "open world"
>     assumption into account, so the resulting OWL statements will
>     provide the intended semantic completeness, even if that isn't
>     represented in the MAGE-ML instances themselves.
>     2) I think the use of OWL Alan describes here is going to be
>     critical to performing broad field, large scale re-analysis of
>     complex data sets such as microarray experiments and various types
>     of neuro-images containing segmented geometric objects (in many
>     ways equivalent to the segmentation performed on microarray images
>     to determine the location and intensity of spots).  The tendency
>     when presenting these results in research articles - and often
>     when sharing the data - is to provide the analyzed/reduced view of
>     the data.  In the context of these complex experiments, many forms
>     of re-analysis will not be possible without access to the
>     originally collected data.  Think of how critical BLAST-based
>     meta-analysis was for GeneBank through the 1990s (and still is). 
>     There are several underlying assertions making it possible to
>     perform such analysis.  Primary among them is the acceptance that
>     each form of sequencing technology provides a reliable way of
>     determining the probability of finding a particular nucleotide at
>     a particular location.  Many sequences are submitted with the
>     simple assertion that at position N in sequence X there is a 100%
>     probability (or 95% confidence, to be more specific) of finding
>     nucleotide A|T|G|C.  To some extent, the statistical analysis
>     performed by BLAST (and other position-sensitive,
>     cross-correlative statistical algorithms) relied on these "ground
>     facts".  For the most part, it was safe to assume this level of
>     reduced data could be safely pooled with other such sequence
>     determinations regardless of the specific sequencing device,
>     underlying biochemical protocols, and specific lots of reagents
>     used.  These same assumptions can not generally be safely assumed
>     for microarray experiments, segmented MRI images - and many other
>     types of images such as IHC or in situ based images.  As an
>     example, just look to the debates in the last year or two
>     regarding the sometimes problematic nature of replicating "gene
>     expression" level results with different arrays covering the
>     "same" genes.  If we are to support the same sort of meta analysis
>     as was common with BLAST across GenBank sequences, then we will
>     have to often supply access to the low level data elements.  This
>     in fact was a major impetus behind providing the MAGE-OM (and
>     FuGE-OM).  As I state at the top of this email with points 'a',
>     'b', & 'c', MAGE-OM/MAGE-ML is extremely useful for several
>     critical tasks related to the handling of this detailed data. 
>     When it comes to supporting the semantically-grounded analytical
>     requirements of such complex, broad field, meta-analysis, however,
>     I think OWL (and sometimes RDF alone) is going to prove a critical
>     enabling technology.
>     3) Re:anonymous classes/individuals of the type Alan describes: 
>     These are essentially "blank nodes" in the RDF sense - "unnamed"
>     nodes based on a collection of necessary restrictions, if I
>     understand things correctly.  Please pardon the naive question,
>     but aren't there some caveats in terms of processing very large
>     RDF and/or OWL graphs containing "blank" or "anonymous" nodes. 
>     For many OWL ontologies, this might not be a concern, but if one
>     were to be tempted to express a large variety of such sets based
>     on different groupings of the sequence probes on a collection of
>     arrays - groupings relevant to specific types of analysis - I
>     could see how these anonymous entities - especially the anonymous
>     sets of individuals - could really proliferate.
>     Many thanks for providing this very helpful exemplar, Alan.
>     Cheers,
>     Bill
>     On Sep 8, 2006, at 9:50 PM, Alan Ruttenberg wrote:
>>     Yes. However I don't think I would change anything I wrote.
>>     Because OWL works  in the open world, we can say that all these
>>     things exists, but only supply the details that we need. But
>>     having the framework which explains the meaning of what is
>>     supplied is one of the points of using ontologies. In this case,
>>     if all we know is that there was some computation that led to
>>     this gene set we could use some arbitrary name for it
>>     (remembering that if we decided to represent it later/ merge it
>>     with the experimental run we can use owl:sameAs to merge our name
>>     with the actual name).
>>     So. with reference to this ontology (generated by Marco, or
>>     imported  from some standard) he could simply state:
>>     Individual(c1 type(Computation)
>>        value(geneComputedAsExpressed g1)
>>        value(geneComputedAsExpressed g2)
>>        value(geneComputedAsExpressed g3)
>>      )
>>     If he wanted to state that the source was an array experiment
>>     (but he didn't know the details), he could add to c1
>>        value(fromExperiment Individual(
>>     type(ExpressionProfileExperiment)))
>>     which uses an anonymous individual (blank node) of the
>>     appropriate type. Now you know that the data originally came from
>>     an expression profile experiment,  though you haven't needed to
>>     add any other information other than that.
>>     The pattern that Marco mentions that is closest to this is
>>>>>     set1 isA GeneSet
>>>>>     set1 hasMember g1, g2, g3
>>     in that we are using the property values on an instance to
>>     represent the set. But the point I wanted to make was that a gene
>>     set isn't some arbitrary set. It is a choice, chosen for a
>>     reason/purpose, and that the ontology should explicitly represent
>>     those reasons/purposes.
>>     If there are defined kinds of follow up, then he could define
>>     define an instance to represent that process too.
>>     Finally, I wanted to make the technical point that that he
>>     doesn't need to use constructs of the form:
>>>>>     set1 derivesFromUnionOf set2, set3
>>     OWL provides the ability to say these things, even when the "set"
>>     is the property values of an instance, for example, given
>>     Individual(c1 type(Computation)
>>        value(geneComputedAsExpressed g1)
>>      )
>>      Individual(c2 type(Computation)
>>        value(geneComputedAsExpressed g2)
>>        value(geneComputedAsExpressed g3)
>>     )
>>     supposing that he wanted to represent a followup list to be
>>     verified by RT PCR represented by the class RTPCRFollowup.
>>     Let's say that wanted to call the property geneToFollowUp, with
>>     inverse geneFollowedUpIn
>>     Individual(RTPCRFollowup1  type(RTPCRFollowup))
>>     EquivalentClasses(
>>       unionOf(
>>         restriction(GeneExpressedAccordingTo hasValue(c1))
>>         restriction(GeneExpressedAccordingTo hasValue(c2)))
>>       restriction(geneFollowedUpIn hasValue(RTPCRFollowup1))))
>>     Now, e.g. Pellet, will conclude that the values of the property
>>     geneToFollowUp of instance RTPCRFollowup1 is exactly g1, g2, g3
>>     Of course that's not the only way to do it, but it does show that
>>     OWL reasoning can make it economical to represent and work with
>>     sets without having to go off and recapitulate set theory.
>>     -Alan
>>     On Sep 8, 2006, at 7:41 PM, Miller, Michael D (Rosetta) wrote:
>>>     Hi Alan,
>>>     What you are describing is described in MAGE-OM/MAGE-ML, as a
>>>     UML model
>>>     to capture the real world aspects of running a microarray
>>>     experiment.
>>>     Typically at the end of this process a set of genes is identified as
>>>     being interesting for some reason and one wants to know more
>>>     about this
>>>     set of genes beyond the microarray experiment that has been
>>>     performed.
>>>     I might be wrong but I think that is where Marco is starting, at
>>>     the end
>>>     of the experiment for follow-up.
>>>     cheers,
>>>     Michael
>>>>     -----Original Message-----
>>>>     From: public-semweb-lifesci-request@w3.org
>>>>     <mailto:public-semweb-lifesci-request@w3.org>
>>>>     [mailto:public-semweb-lifesci-request@w3.org] On Behalf Of
>>>>     Alan Ruttenberg
>>>>     Sent: Friday, September 08, 2006 3:07 PM
>>>>     To: Marco Brandizi
>>>>     Cc: semantic-web@w3.org <mailto:semantic-web@w3.org>;
>>>>     public-semweb-lifesci@w3.org <mailto:public-semweb-lifesci@w3.org>
>>>>     Subject: Re: Playing with sets in OWL...
>>>>     Hi Marco,
>>>>     There are a number of ways to work with sets, but I don't think I'd
>>>>     approach this problem from that point of view.
>>>>     Rather,  I would start by thinking about what my domain instances
>>>>     are, what their properties are, and what kinds of questions I
>>>>     want to
>>>>     be able to ask based on the representation. I'll sketch this out a
>>>>     bit, though the fact that I name an object or property doesn't mean
>>>>     that you have to supply it (remember OWL is open-world) - still
>>>>     listing these make the ontology makes your intentions clearer  and
>>>>     the ontology easier to work with by others.
>>>>     The heading in each of these is a class, of which you would
>>>>     make one
>>>>     or more instances to represent your results.
>>>>     The indented names are properties on instances of that class.
>>>>     An expression technology:
>>>>         Vendor:
>>>>         Product: e.g. array name
>>>>         Name of spots on the array
>>>>         Mappings:  (maps of spot to gene - you might use e.g.
>>>>     affymetrix,
>>>>     or you might compute your own)
>>>>     ExpressionTechnologyMap
>>>>        SpotMapping: (each value a spot mapping)
>>>>     Spot mapping:
>>>>        SpotID:
>>>>        GeneID:
>>>>     An expression profile experiment (call yours exp0)
>>>>         When done:
>>>>         Who did it:
>>>>         What technology was used: (an expression technology)
>>>>         Sample: (a sample)
>>>>         Treatment: ...
>>>>         Levels: A bunch of pairs of spot name, intensity
>>>>     Spot intensity
>>>>        SpotID:
>>>>        Intensity:
>>>>     A  computation of which spots/genes are "expressed" (call yours c1)
>>>>         Name of the method : e.g. mas5 above threshold
>>>>         Parameter of the method: e.g. the threshold
>>>>         Experiment: exp0
>>>>         Spot Expressed: spots that were over threshold
>>>>         Gene Computed As Expressed: genes that were over threshold
>>>>     And maybe:
>>>>     Conclusion
>>>>         What was concluded:
>>>>         By who:
>>>>         Based on: c1
>>>>     All of what you enter for your experiment are instances (so
>>>>     there are
>>>>     no issues of OWL Full)
>>>>     Now, The gene set you wanted can be expressed as a class:
>>>>     Let's define an inverse property of
>>>>     "GeneComputedAsExpressed", call
>>>>     it "GeneExpressedAccordingTo"
>>>>     Class(Set1 partial restriction(GeneExpressedAccordingTo
>>>>     hasValue(c1))
>>>>     Instances of Set1 will be those genes. You may or may not want to
>>>>     actually define this class. However I don't think that youneed
>>>>     to add any properties to it. Everything you would want to say
>>>>     probably wants to be said on one of the instances - the experiment,
>>>>     the computation, the conclusion, etc.
>>>>     Let me know if this helps/hurts - glad to discuss this some more
>>>>     -Alan
>>>>     2)
>>>>     On Sep 8, 2006, at 11:58 AM, Marco Brandizi wrote:
>>>>>     Hi all,
>>>>>     sorry for the possible triviality of my questions, or the
>>>>     messed-up
>>>>>     mind
>>>>>     I am possibly showing...
>>>>>     I am trying to model the grouping of individuals into sets. In my
>>>>>     application domain, the gene expression, people put
>>>>     together, let's
>>>>>     say
>>>>>     genes, associating a meaning to the sets.
>>>>>     For instance:
>>>>>     Set1 := { gene1, gene2, gene3 }
>>>>>     is the set of genes that are expressed in experiment0
>>>>>     (genei and exp0 are OWL individuals)
>>>>>     I am understanding that this may be formalized in OWL by:
>>>>>     - declaring Set1 as owl:subClassOf Gene
>>>>>     - using oneOf to declare the membership of g1,2,3
>>>>>     (or simpler: (g1 type Set1), (g2 type Set1), etc. )
>>>>>     - using hasValue with expressed and exp0
>>>>>     (right?)
>>>>>     Now, I am trying to build an application which is like a semantic
>>>>>     wiki.
>>>>>     Hence users have a quite direct contact with the underline
>>>>>     ontology, and
>>>>>     they can write, with a simplified syntax, statements about a
>>>>>     subject
>>>>>     they are describing (subject-centric approach).
>>>>>     Commiting to the very formal formalism of OWL looks a bit
>>>>     too much...
>>>>>     formal... ;-) and hard to be handled with a semantic wiki-like
>>>>>     application.
>>>>>     Another problem is that the set could have properties on
>>>>     its own, for
>>>>>     instance:
>>>>>     Set1 hasAuthor Jhon
>>>>>     meaning that John is defining it. But hasAuthor is
>>>>     typically used for
>>>>>     individuals, and I wouldn't like to fall in OWL-Full, by
>>>>     making an OWL
>>>>>     reasoner to interpret Set1 both as an individual and a class.
>>>>>     Aren't there more informal (although less precise) methods to
>>>>>     model
>>>>>     sets, or list of individuals?
>>>>>     An approach could be modeling some sort of set-theory over
>>>>>     individuals:
>>>>>     set1 isA GeneSet
>>>>>     set1 hasMember g1, g2, g3
>>>>>     ...
>>>>>     set1 derivesFromUnionOf set2, set3
>>>>>     ...
>>>>>     But I am not sure it would be a good approach, or if someone else
>>>>>     already tried that.
>>>>>     Any suggestion?
>>>>>     Thanks in advance for a reply.
>>>>>     Cheers.
>>>>>     -- 
>>>>     ==============================================================
>>>>     ========
>>>>>     =========
>>>>>     Marco Brandizi <brandizi@ebi.ac.uk <mailto:brandizi@ebi.ac.uk>>
>>>>>     http://gca.btbs.unimib.it/brandizi
>     Bill Bug
>     Senior Research Analyst/Ontological Engineer
>     Laboratory for Bioimaging  & Anatomical Informatics
>     www.neuroterrain.org
>     Department of Neurobiology & Anatomy
>     Drexel University College of Medicine
>     2900 Queen Lane
>     Philadelphia, PA    19129
>     215 991 8430 (ph)
>     610 457 0443 (mobile)
>     215 843 9367 (fax)
>     Please Note: I now have a new email - William.Bug@DrexelMed.edu
>     <mailto:William.Bug@DrexelMed.edu>
