Re: Playing with sets in OWL... from William Bug on 2006-09-09 (semantic-web@w3.org from September 2006)

From: William Bug <William.Bug@DrexelMed.edu>
Date: Fri, 8 Sep 2006 23:39:15 -0400
To: Alan Ruttenberg <alanruttenberg@gmail.com>
Cc: "Miller, Michael D (Rosetta)" <Michael_Miller@Rosettabio.com>, "Marco Brandizi" <brandizi@ebi.ac.uk>, semantic-web@w3.org, public-semweb-lifesci@w3.org
Message-Id: <E8DA7535-189F-4611-AA06-C655EB1EBADE@DrexelMed.edu>
I think Alan is making a very important general point here.

MAGE-ML/MAGE-OM is perfectly tuned to the needs of:
	a) transferring entire microarray data sets across systems
	b) persisting microarray data sets (at least in certain scenarios)
	c) providing a systematic, normative interface for writing code to  
access specific elements and data collections one typically finds in  
the description of a microarray data set
This is the sort of functionality data models are particularly well  
suited at supporting.

MAGE-OM/MAGE-ML is also the result of a huge amount of deliberation  
from dozens of experts in the informatics fields involved in  
generating, storing, and manipulating microarray data.

When it comes to manipulating the information associated with a  
microarray experiment - or collection of experiments - in a  
semantically explicitly manner, however, RDF is really the preferred  
formalism providing the required explicit semantics, while still  
providing the expressiveness needed to characterize the inherent  
variety, complexity, and granularity in this information.  When it  
comes to filling out the assertions to the point of being able to  
reason on them - even simple reasoning such as consistency checks -  
some dialect of OWL will be the formalism of choice, I believe.

I think Alan gives a very clear example of how to use OWL in this  
particular situation described by Marco.

I have just a few questions in followup:
	1) The MAGE-ML XML Schema provides for a great deal of flexibility  
via the use of optional fields.  Still, any given use in a specific  
lab for a specific collection of microarray experiments is likely to  
develop it's own conventions for which fields to use and which not to  
use - and how to populate the more "open" elements.  With this in  
mind, it seems it should be possible under those circumstances to  
create an XSLT to translate the individuals contained in a MAGE-ML  
instance according to the elemental OWL classes Alan described -  
Expression_technology, Expression_technology_map, Spot_mapping,  
Expression_profile_experiment, Spot_intensity,  
Gene_expression_computation.  The latter can probably be  
reconstituted from the MAGE-ML elements BioAssay, BioAssayData,  
HigherLevelAnalysis, Measurement, and QuantitationType.  In  
translating the instance data into OWL, it should then be possible to  
perform the sort of higher level sorting and re-analysis Alan  
describes.  The translation should probably take the "open world"  
assumption into account, so the resulting OWL statements will provide  
the intended semantic completeness, even if that isn't represented in  
the MAGE-ML instances themselves.

	2) I think the use of OWL Alan describes here is going to be  
critical to performing broad field, large scale re-analysis of  
complex data sets such as microarray experiments and various types of  
neuro-images containing segmented geometric objects (in many ways  
equivalent to the segmentation performed on microarray images to  
determine the location and intensity of spots).  The tendency when  
presenting these results in research articles - and often when  
sharing the data - is to provide the analyzed/reduced view of the  
data.  In the context of these complex experiments, many forms of re- 
analysis will not be possible without access to the originally  
collected data.  Think of how critical BLAST-based meta-analysis was  
for GeneBank through the 1990s (and still is).  There are several  
underlying assertions making it possible to perform such analysis.   
Primary among them is the acceptance that each form of sequencing  
technology provides a reliable way of determining the probability of  
finding a particular nucleotide at a particular location.  Many  
sequences are submitted with the simple assertion that at position N  
in sequence X there is a 100% probability (or 95% confidence, to be  
more specific) of finding nucleotide A|T|G|C.  To some extent, the  
statistical analysis performed by BLAST (and other position- 
sensitive, cross-correlative statistical algorithms) relied on these  
"ground facts".  For the most part, it was safe to assume this level  
of reduced data could be safely pooled with other such sequence  
determinations regardless of the specific sequencing device,  
underlying biochemical protocols, and specific lots of reagents  
used.  These same assumptions can not generally be safely assumed for  
microarray experiments, segmented MRI images - and many other types  
of images such as IHC or in situ based images.  As an example, just  
look to the debates in the last year or two regarding the sometimes  
problematic nature of replicating "gene expression" level results  
with different arrays covering the "same" genes.  If we are to  
support the same sort of meta analysis as was common with BLAST  
across GenBank sequences, then we will have to often supply access to  
the low level data elements.  This in fact was a major impetus behind  
providing the MAGE-OM (and FuGE-OM).  As I state at the top of this  
email with points 'a', 'b', & 'c', MAGE-OM/MAGE-ML is extremely  
useful for several critical tasks related to the handling of this  
detailed data.  When it comes to supporting the semantically-grounded  
analytical requirements of such complex, broad field, meta-analysis,  
however, I think OWL (and sometimes RDF alone) is going to prove a  
critical enabling technology.

	3) Re:anonymous classes/individuals of the type Alan describes:   
These are essentially "blank nodes" in the RDF sense - "unnamed"  
nodes based on a collection of necessary restrictions, if I  
understand things correctly.  Please pardon the naive question, but  
aren't there some caveats in terms of processing very large RDF and/ 
or OWL graphs containing "blank" or "anonymous" nodes.  For many OWL  
ontologies, this might not be a concern, but if one were to be  
tempted to express a large variety of such sets based on different  
groupings of the sequence probes on a collection of arrays -  
groupings relevant to specific types of analysis - I could see how  
these anonymous entities - especially the anonymous sets of  
individuals - could really proliferate.

Many thanks for providing this very helpful exemplar, Alan.

Cheers,
Bill


On Sep 8, 2006, at 9:50 PM, Alan Ruttenberg wrote:

>
> Yes. However I don't think I would change anything I wrote. Because  
> OWL works  in the open world, we can say that all these things  
> exists, but only supply the details that we need. But having the  
> framework which explains the meaning of what is supplied is one of  
> the points of using ontologies. In this case, if all we know is  
> that there was some computation that led to this gene set we could  
> use some arbitrary name for it (remembering that if we decided to  
> represent it later/ merge it with the experimental run we can use  
> owl:sameAs to merge our name with the actual name).
>
> So. with reference to this ontology (generated by Marco, or  
> imported  from some standard) he could simply state:
>
> Individual(c1 type(Computation)
>    value(geneComputedAsExpressed g1)
>    value(geneComputedAsExpressed g2)
>    value(geneComputedAsExpressed g3)
>  )
>
> If he wanted to state that the source was an array experiment (but  
> he didn't know the details), he could add to c1
>
>    value(fromExperiment Individual( type 
> (ExpressionProfileExperiment)))
>
> which uses an anonymous individual (blank node) of the appropriate  
> type. Now you know that the data originally came from an expression  
> profile experiment,  though you haven't needed to add any other  
> information other than that.
>
> The pattern that Marco mentions that is closest to this is
>
>>>> set1 isA GeneSet
>>>> set1 hasMember g1, g2, g3
>
> in that we are using the property values on an instance to  
> represent the set. But the point I wanted to make was that a gene  
> set isn't some arbitrary set. It is a choice, chosen for a reason/ 
> purpose, and that the ontology should explicitly represent those  
> reasons/purposes.
>
> If there are defined kinds of follow up, then he could define  
> define an instance to represent that process too.
>
> Finally, I wanted to make the technical point that that he doesn't  
> need to use constructs of the form:
>
>>>> set1 derivesFromUnionOf set2, set3
>
> OWL provides the ability to say these things, even when the "set"  
> is the property values of an instance, for example, given
>
> Individual(c1 type(Computation)
>    value(geneComputedAsExpressed g1)
>  )
>
>  Individual(c2 type(Computation)
>    value(geneComputedAsExpressed g2)
>    value(geneComputedAsExpressed g3)
> )
>
> supposing that he wanted to represent a followup list to be  
> verified by RT PCR represented by the class RTPCRFollowup.
> Let's say that wanted to call the property geneToFollowUp, with  
> inverse geneFollowedUpIn
>
> Individual(RTPCRFollowup1  type(RTPCRFollowup))
>
> EquivalentClasses(
>   unionOf(
>     restriction(GeneExpressedAccordingTo hasValue(c1))
>     restriction(GeneExpressedAccordingTo hasValue(c2)))
>   restriction(geneFollowedUpIn hasValue(RTPCRFollowup1))))
>
> Now, e.g. Pellet, will conclude that the values of the property  
> geneToFollowUp of instance RTPCRFollowup1 is exactly g1, g2, g3
>
> Of course that's not the only way to do it, but it does show that  
> OWL reasoning can make it economical to represent and work with  
> sets without having to go off and recapitulate set theory.
>
> -Alan
>
> On Sep 8, 2006, at 7:41 PM, Miller, Michael D (Rosetta) wrote:
>
>>
>> Hi Alan,
>>
>> What you are describing is described in MAGE-OM/MAGE-ML, as a UML  
>> model
>> to capture the real world aspects of running a microarray experiment.
>>
>> Typically at the end of this process a set of genes is identified as
>> being interesting for some reason and one wants to know more about  
>> this
>> set of genes beyond the microarray experiment that has been  
>> performed.
>>
>> I might be wrong but I think that is where Marco is starting, at  
>> the end
>> of the experiment for follow-up.
>>
>> cheers,
>> Michael
>>
>>> -----Original Message-----
>>> From: public-semweb-lifesci-request@w3.org
>>> [mailto:public-semweb-lifesci-request@w3.org] On Behalf Of
>>> Alan Ruttenberg
>>> Sent: Friday, September 08, 2006 3:07 PM
>>> To: Marco Brandizi
>>> Cc: semantic-web@w3.org; public-semweb-lifesci@w3.org
>>> Subject: Re: Playing with sets in OWL...
>>>
>>>
>>>
>>> Hi Marco,
>>>
>>> There are a number of ways to work with sets, but I don't think I'd
>>> approach this problem from that point of view.
>>> Rather,  I would start by thinking about what my domain instances
>>> are, what their properties are, and what kinds of questions I
>>> want to
>>> be able to ask based on the representation. I'll sketch this out a
>>> bit, though the fact that I name an object or property doesn't mean
>>> that you have to supply it (remember OWL is open-world) - still
>>> listing these make the ontology makes your intentions clearer  and
>>> the ontology easier to work with by others.
>>>
>>> The heading in each of these is a class, of which you would make one
>>> or more instances to represent your results.
>>> The indented names are properties on instances of that class.
>>>
>>> An expression technology:
>>>     Vendor:
>>>     Product: e.g. array name
>>>     Name of spots on the array
>>>     Mappings:  (maps of spot to gene - you might use e.g.
>>> affymetrix,
>>> or you might compute your own)
>>>
>>> ExpressionTechnologyMap
>>>    SpotMapping: (each value a spot mapping)
>>>
>>> Spot mapping:
>>>    SpotID:
>>>    GeneID:
>>>
>>> An expression profile experiment (call yours exp0)
>>>     When done:
>>>     Who did it:
>>>     What technology was used: (an expression technology)
>>>     Sample: (a sample)
>>>     Treatment: ...
>>>     Levels: A bunch of pairs of spot name, intensity
>>>
>>> Spot intensity
>>>    SpotID:
>>>    Intensity:
>>>
>>> A  computation of which spots/genes are "expressed" (call yours c1)
>>>     Name of the method : e.g. mas5 above threshold
>>>     Parameter of the method: e.g. the threshold
>>>     Experiment: exp0
>>>     Spot Expressed: spots that were over threshold
>>>     Gene Computed As Expressed: genes that were over threshold
>>>
>>> And maybe:
>>>
>>> Conclusion
>>>     What was concluded:
>>>     By who:
>>>     Based on: c1
>>>
>>> All of what you enter for your experiment are instances (so
>>> there are
>>> no issues of OWL Full)
>>>
>>> Now, The gene set you wanted can be expressed as a class:
>>>
>>> Let's define an inverse property of
>>> "GeneComputedAsExpressed", call
>>> it "GeneExpressedAccordingTo"
>>>
>>> Class(Set1 partial restriction(GeneExpressedAccordingTo hasValue 
>>> (c1))
>>>
>>> Instances of Set1 will be those genes. You may or may not want to
>>> actually define this class. However I don't think that youneed
>>> to add any properties to it. Everything you would want to say
>>> probably wants to be said on one of the instances - the experiment,
>>> the computation, the conclusion, etc.
>>>
>>> Let me know if this helps/hurts - glad to discuss this some more
>>>
>>> -Alan
>>>
>>>
>>>
>>>
>>> 2)
>>>
>>> On Sep 8, 2006, at 11:58 AM, Marco Brandizi wrote:
>>>
>>>>
>>>> Hi all,
>>>>
>>>> sorry for the possible triviality of my questions, or the
>>> messed-up
>>>> mind
>>>> I am possibly showing...
>>>>
>>>> I am trying to model the grouping of individuals into sets. In my
>>>> application domain, the gene expression, people put
>>> together, let's
>>>> say
>>>> genes, associating a meaning to the sets.
>>>>
>>>> For instance:
>>>>
>>>> Set1 := { gene1, gene2, gene3 }
>>>>
>>>> is the set of genes that are expressed in experiment0
>>>>
>>>> (genei and exp0 are OWL individuals)
>>>>
>>>>
>>>> I am understanding that this may be formalized in OWL by:
>>>>
>>>> - declaring Set1 as owl:subClassOf Gene
>>>> - using oneOf to declare the membership of g1,2,3
>>>> (or simpler: (g1 type Set1), (g2 type Set1), etc. )
>>>> - using hasValue with expressed and exp0
>>>>
>>>> (right?)
>>>>
>>>> Now, I am trying to build an application which is like a semantic
>>>> wiki.
>>>>
>>>> Hence users have a quite direct contact with the underline
>>>> ontology, and
>>>> they can write, with a simplified syntax, statements about a  
>>>> subject
>>>> they are describing (subject-centric approach).
>>>>
>>>> Commiting to the very formal formalism of OWL looks a bit
>>> too much...
>>>> formal... ;-) and hard to be handled with a semantic wiki-like
>>>> application.
>>>>
>>>> Another problem is that the set could have properties on
>>> its own, for
>>>> instance:
>>>>
>>>> Set1 hasAuthor Jhon
>>>>
>>>> meaning that John is defining it. But hasAuthor is
>>> typically used for
>>>> individuals, and I wouldn't like to fall in OWL-Full, by
>>> making an OWL
>>>> reasoner to interpret Set1 both as an individual and a class.
>>>>
>>>> Aren't there more informal (although less precise) methods to model
>>>> sets, or list of individuals?
>>>>
>>>> An approach could be modeling some sort of set-theory over
>>>> individuals:
>>>>
>>>> set1 isA GeneSet
>>>> set1 hasMember g1, g2, g3
>>>> ...
>>>>
>>>> set1 derivesFromUnionOf set2, set3
>>>>
>>>> ...
>>>>
>>>> But I am not sure it would be a good approach, or if someone else
>>>> already tried that.
>>>>
>>>> Any suggestion?
>>>>
>>>>
>>>> Thanks in advance for a reply.
>>>>
>>>> Cheers.
>>>>
>>>> -- 
>>>>
>>>>
>>> ==============================================================
>>> ========
>>>> =========
>>>> Marco Brandizi <brandizi@ebi.ac.uk>
>>>> http://gca.btbs.unimib.it/brandizi
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Bill Bug
Senior Research Analyst/Ontological Engineer

Laboratory for Bioimaging  & Anatomical Informatics
www.neuroterrain.org
Department of Neurobiology & Anatomy
Drexel University College of Medicine
2900 Queen Lane
Philadelphia, PA    19129
215 991 8430 (ph)
610 457 0443 (mobile)
215 843 9367 (fax)


Please Note: I now have a new email - William.Bug@DrexelMed.edu







This email and any accompanying attachments are confidential. 
This information is intended solely for the use of the individual 
to whom it is addressed. Any review, disclosure, copying, 
distribution, or use of this email communication by others is strictly 
prohibited. If you are not the intended recipient please notify us 
immediately by returning this message to the sender and delete 
all copies. Thank you for your cooperation.
Received on Saturday, 9 September 2006 03:39:45 UTC