RE: Playing with sets in OWL... from Miller, Michael D (Rosetta) on 2006-09-09 (semantic-web@w3.org from September 2006)

From: Miller, Michael D (Rosetta) <Michael_Miller@Rosettabio.com>
Date: Sat, 9 Sep 2006 13:52:16 -0700
To: "William Bug" <William.Bug@DrexelMed.edu>, "Alan Ruttenberg" <alanruttenberg@gmail.com>
cc: "Marco Brandizi" <brandizi@ebi.ac.uk>, semantic-web@w3.org, public-semweb-lifesci@w3.org
Message-ID: <E1GM9oX-00012M-Fy@aji.w3.org>
Hi Bill and Allan,
 
You misunderstand my use case.
 
My researcher doesn't much care that the world knows about his/her
microarray experiment yet--in fact he/she may very well be searching for
interesting information about the gene set to see whether it is worth
going further or whether the experiment was just retreading old ground
or whatever.  There's this new tool, the semantic web, so the researcher
is going to submit this set of genes and hopefully get useful
information on them as a set.
 
Now this researcher probably makes the assumption that as long as the
naming source of the genes is indicated, no further work is required.
This naming source may very well be GenBank, which, of course, isn't
likely to be set up for easy access for pure semantic web tools for many
years, if ever, as many people on this list would like but better be
supported by the semantic web because for all its faults, and all the
faults of the current sequence databases, if the semantic web can't
garner information from them I don't see much hope for adoption from the
common researcher.
 
So perhaps the researcher gets back that the genes are part of a
particular pathway, there were a few papers in PubMed that mentions
them, some microarray experiments in public repositories had them
significantly up or down regulated or effected by some drug but the
conclusions for this experiment appear to be worthy so hopefully the
experiment will get annotated (with these semantic web results as well),
be made part of a submittal, and the experiment itself deposited in a
public database to now be accessible for others searching the semantic
web.
 
"In translating the instance data into OWL, it should then be possible
to perform the sort of higher level sorting and re-analysis Alan
describes."
 
Although this wasn't the use case I was talking about in this thread, it
is obviously a very interesting use case also.  I believe in an earlier
e-mail I talked about something similar, but one will be unlikely to
find out, in general, about the individuals in the experiment (outside
of genes), because they will be truly unique instances of things like
samples, hybridization and feature extraction and data but, if the
researcher annotates these individuals from rich ontologies or from not
so great sources that tools are developed to compensate for, then I
agree entirely with you and Alan that much useful reasoning can be done
on the semantic web.
 
"The tendency when presenting these results in research articles - and
often when sharing the data - is to provide the analyzed/reduced view of
the data"
 
Actually, I just heard Leroy Hood of ISB and other fame and Eric Schadt
for Rosetta Inpharmatics (our parent company) give excellent talks at
the MGED9 meeting where their research is opening up and bringing in
information and data from a vast number of resources and tying it
together into big pictures, all without the semantic web.  I'm sure they
would love to have the kind of power envisioned by the W3C for the
semantic web but they won't touch it until it is easy--they are busy
doing their core jobs.
 
So I really think that we need to:
 
1) make sure the semantic web allows people to poke at it, I.e. ask the
question is there anything interesting about a particular object,
without having to say why they are interested
 
2) provide tools so that they can annotate their objects well so that
when they are submitted they can be incorporated into the web (moving
forward, this is one aim of the MAGEstk for gene expression experiments)
 
3) provide that existing imperfect resources have semantic web tools
that can overcome those imperfections and get the usefulness from them
people are currently getting
 
4) most importantly get a useful semantic web out there now, there's
plenty of information available, then make it better as time goes along
 
The resources that are ready set up for easy integration into the
semantic web will come along for free.
 
cheers,
Michael
 

	-----Original Message-----
	From: William Bug [mailto:William.Bug@DrexelMed.edu] 
	Sent: Friday, September 08, 2006 8:39 PM
	To: Alan Ruttenberg
	Cc: Miller, Michael D (Rosetta); Marco Brandizi;
semantic-web@w3.org; public-semweb-lifesci@w3.org
	Subject: Re: Playing with sets in OWL...
	
	
	I think Alan is making a very important general point here. 

	MAGE-ML/MAGE-OM is perfectly tuned to the needs of:
	a) transferring entire microarray data sets across systems
	b) persisting microarray data sets (at least in certain
scenarios)
	c) providing a systematic, normative interface for writing code
to access specific elements and data collections one typically finds in
the description of a microarray data set
	This is the sort of functionality data models are particularly
well suited at supporting.  
	
	
	MAGE-OM/MAGE-ML is also the result of a huge amount of
deliberation from dozens of experts in the informatics fields involved
in generating, storing, and manipulating microarray data.

	When it comes to manipulating the information associated with a
microarray experiment - or collection of experiments - in a semantically
explicitly manner, however, RDF is really the preferred formalism
providing the required explicit semantics, while still providing the
expressiveness needed to characterize the inherent variety, complexity,
and granularity in this information.  When it comes to filling out the
assertions to the point of being able to reason on them - even simple
reasoning such as consistency checks - some dialect of OWL will be the
formalism of choice, I believe.

	I think Alan gives a very clear example of how to use OWL in
this particular situation described by Marco.
	
	
	I have just a few questions in followup:
	1) The MAGE-ML XML Schema provides for a great deal of
flexibility via the use of optional fields.  Still, any given use in a
specific lab for a specific collection of microarray experiments is
likely to develop it's own conventions for which fields to use and which
not to use - and how to populate the more "open" elements.  With this in
mind, it seems it should be possible under those circumstances to create
an XSLT to translate the individuals contained in a MAGE-ML instance
according to the elemental OWL classes Alan described -
Expression_technology, Expression_technology_map, Spot_mapping,
Expression_profile_experiment, Spot_intensity,
Gene_expression_computation.  The latter can probably be reconstituted
from the MAGE-ML elements BioAssay, BioAssayData, HigherLevelAnalysis,
Measurement, and QuantitationType.  In translating the instance data
into OWL, it should then be possible to perform the sort of higher level
sorting and re-analysis Alan describes.  The translation should probably
take the "open world" assumption into account, so the resulting OWL
statements will provide the intended semantic completeness, even if that
isn't represented in the MAGE-ML instances themselves.

	2) I think the use of OWL Alan describes here is going to be
critical to performing broad field, large scale re-analysis of complex
data sets such as microarray experiments and various types of
neuro-images containing segmented geometric objects (in many ways
equivalent to the segmentation performed on microarray images to
determine the location and intensity of spots).  The tendency when
presenting these results in research articles - and often when sharing
the data - is to provide the analyzed/reduced view of the data.  In the
context of these complex experiments, many forms of re-analysis will not
be possible without access to the originally collected data.  Think of
how critical BLAST-based meta-analysis was for GeneBank through the
1990s (and still is).  There are several underlying assertions making it
possible to perform such analysis.  Primary among them is the acceptance
that each form of sequencing technology provides a reliable way of
determining the probability of finding a particular nucleotide at a
particular location.  Many sequences are submitted with the simple
assertion that at position N in sequence X there is a 100% probability
(or 95% confidence, to be more specific) of finding nucleotide A|T|G|C.
To some extent, the statistical analysis performed by BLAST (and other
position-sensitive, cross-correlative statistical algorithms) relied on
these "ground facts".  For the most part, it was safe to assume this
level of reduced data could be safely pooled with other such sequence
determinations regardless of the specific sequencing device, underlying
biochemical protocols, and specific lots of reagents used.  These same
assumptions can not generally be safely assumed for microarray
experiments, segmented MRI images - and many other types of images such
as IHC or in situ based images.  As an example, just look to the debates
in the last year or two regarding the sometimes problematic nature of
replicating "gene expression" level results with different arrays
covering the "same" genes.  If we are to support the same sort of meta
analysis as was common with BLAST across GenBank sequences, then we will
have to often supply access to the low level data elements.  This in
fact was a major impetus behind providing the MAGE-OM (and FuGE-OM).  As
I state at the top of this email with points 'a', 'b', & 'c',
MAGE-OM/MAGE-ML is extremely useful for several critical tasks related
to the handling of this detailed data.  When it comes to supporting the
semantically-grounded analytical requirements of such complex, broad
field, meta-analysis, however, I think OWL (and sometimes RDF alone) is
going to prove a critical enabling technology.

	3) Re:anonymous classes/individuals of the type Alan describes:
These are essentially "blank nodes" in the RDF sense - "unnamed" nodes
based on a collection of necessary restrictions, if I understand things
correctly.  Please pardon the naive question, but aren't there some
caveats in terms of processing very large RDF and/or OWL graphs
containing "blank" or "anonymous" nodes.  For many OWL ontologies, this
might not be a concern, but if one were to be tempted to express a large
variety of such sets based on different groupings of the sequence probes
on a collection of arrays - groupings relevant to specific types of
analysis - I could see how these anonymous entities - especially the
anonymous sets of individuals - could really proliferate.

	Many thanks for providing this very helpful exemplar, Alan.

	Cheers,
	Bill

	
	
	On Sep 8, 2006, at 9:50 PM, Alan Ruttenberg wrote:



		Yes. However I don't think I would change anything I
wrote. Because OWL works  in the open world, we can say that all these
things exists, but only supply the details that we need. But having the
framework which explains the meaning of what is supplied is one of the
points of using ontologies. In this case, if all we know is that there
was some computation that led to this gene set we could use some
arbitrary name for it (remembering that if we decided to represent it
later/ merge it with the experimental run we can use owl:sameAs to merge
our name with the actual name).

		So. with reference to this ontology (generated by Marco,
or imported  from some standard) he could simply state:

		Individual(c1 type(Computation)
		   value(geneComputedAsExpressed g1)
		   value(geneComputedAsExpressed g2)
		   value(geneComputedAsExpressed g3)
		 )

		If he wanted to state that the source was an array
experiment (but he didn't know the details), he could add to c1

		   value(fromExperiment Individual(
type(ExpressionProfileExperiment)))

		which uses an anonymous individual (blank node) of the
appropriate type. Now you know that the data originally came from an
expression profile experiment,  though you haven't needed to add any
other information other than that.

		The pattern that Marco mentions that is closest to this
is


				set1 isA GeneSet
				set1 hasMember g1, g2, g3


		in that we are using the property values on an instance
to represent the set. But the point I wanted to make was that a gene set
isn't some arbitrary set. It is a choice, chosen for a reason/purpose,
and that the ontology should explicitly represent those
reasons/purposes.

		If there are defined kinds of follow up, then he could
define define an instance to represent that process too.

		Finally, I wanted to make the technical point that that
he doesn't need to use constructs of the form:


				set1 derivesFromUnionOf set2, set3


		OWL provides the ability to say these things, even when
the "set" is the property values of an instance, for example, given

		Individual(c1 type(Computation)
		   value(geneComputedAsExpressed g1)
		 )
		
		
		 Individual(c2 type(Computation)
		   value(geneComputedAsExpressed g2)
		   value(geneComputedAsExpressed g3)
		)

		supposing that he wanted to represent a followup list to
be verified by RT PCR represented by the class RTPCRFollowup.
		Let's say that wanted to call the property
geneToFollowUp, with inverse geneFollowedUpIn

		Individual(RTPCRFollowup1  type(RTPCRFollowup))

		EquivalentClasses(
		  unionOf(
		    restriction(GeneExpressedAccordingTo hasValue(c1))
		    restriction(GeneExpressedAccordingTo hasValue(c2)))
		  restriction(geneFollowedUpIn
hasValue(RTPCRFollowup1))))
		
		
		Now, e.g. Pellet, will conclude that the values of the
property geneToFollowUp of instance RTPCRFollowup1 is exactly g1, g2, g3

		Of course that's not the only way to do it, but it does
show that OWL reasoning can make it economical to represent and work
with sets without having to go off and recapitulate set theory.

		-Alan

		On Sep 8, 2006, at 7:41 PM, Miller, Michael D (Rosetta)
wrote:



			Hi Alan,

			What you are describing is described in
MAGE-OM/MAGE-ML, as a UML model
			to capture the real world aspects of running a
microarray experiment.
			
			
			Typically at the end of this process a set of
genes is identified as
			being interesting for some reason and one wants
to know more about this
			set of genes beyond the microarray experiment
that has been performed.

			I might be wrong but I think that is where Marco
is starting, at the end
			of the experiment for follow-up.

			cheers,
			Michael


				-----Original Message-----
				From:
public-semweb-lifesci-request@w3.org
	
[mailto:public-semweb-lifesci-request@w3.org] On Behalf Of
				Alan Ruttenberg
				Sent: Friday, September 08, 2006 3:07 PM
				To: Marco Brandizi
				Cc: semantic-web@w3.org;
public-semweb-lifesci@w3.org
				Subject: Re: Playing with sets in OWL...



				Hi Marco,

				There are a number of ways to work with
sets, but I don't think I'd
				approach this problem from that point of
view.
				Rather,  I would start by thinking about
what my domain instances
				are, what their properties are, and what
kinds of questions I
				want to
				be able to ask based on the
representation. I'll sketch this out a
				bit, though the fact that I name an
object or property doesn't mean
				that you have to supply it (remember OWL
is open-world) - still
				listing these make the ontology makes
your intentions clearer  and
				the ontology easier to work with by
others.

				The heading in each of these is a class,
of which you would make one
				or more instances to represent your
results.
				The indented names are properties on
instances of that class.

				An expression technology:
				    Vendor:
				    Product: e.g. array name
				    Name of spots on the array
				    Mappings:  (maps of spot to gene -
you might use e.g.
				affymetrix,
				or you might compute your own)

				ExpressionTechnologyMap
				   SpotMapping: (each value a spot
mapping)

				Spot mapping:
				   SpotID:
				   GeneID:

				An expression profile experiment (call
yours exp0)
				    When done:
				    Who did it:
				    What technology was used: (an
expression technology)
				    Sample: (a sample)
				    Treatment: ...
				    Levels: A bunch of pairs of spot
name, intensity

				Spot intensity
				   SpotID:
				   Intensity:

				A  computation of which spots/genes are
"expressed" (call yours c1)
				    Name of the method : e.g. mas5 above
threshold
				    Parameter of the method: e.g. the
threshold
				    Experiment: exp0
				    Spot Expressed: spots that were over
threshold
				    Gene Computed As Expressed: genes
that were over threshold

				And maybe:

				Conclusion
				    What was concluded:
				    By who:
				    Based on: c1
				
				
				All of what you enter for your
experiment are instances (so
				there are
				no issues of OWL Full)

				Now, The gene set you wanted can be
expressed as a class:

				Let's define an inverse property of
				"GeneComputedAsExpressed", call
				it "GeneExpressedAccordingTo"

				Class(Set1 partial
restriction(GeneExpressedAccordingTo hasValue(c1))

				Instances of Set1 will be those genes.
You may or may not want to
				actually define this class. However I
don't think that youneed
				to add any properties to it. Everything
you would want to say
				probably wants to be said on one of the
instances - the experiment,
				the computation, the conclusion, etc.

				Let me know if this helps/hurts - glad
to discuss this some more

				-Alan




				2)

				On Sep 8, 2006, at 11:58 AM, Marco
Brandizi wrote:



				Hi all,

				sorry for the possible triviality of my
questions, or the

				messed-up

				mind
				I am possibly showing...

				I am trying to model the grouping of
individuals into sets. In my
				application domain, the gene expression,
people put

				together, let's

				say
				genes, associating a meaning to the
sets.
				
				
				For instance:

				Set1 := { gene1, gene2, gene3 }

				is the set of genes that are expressed
in experiment0

				(genei and exp0 are OWL individuals)


				I am understanding that this may be
formalized in OWL by:

				- declaring Set1 as owl:subClassOf Gene
				- using oneOf to declare the membership
of g1,2,3
				(or simpler: (g1 type Set1), (g2 type
Set1), etc. )
				- using hasValue with expressed and exp0

				(right?)

				Now, I am trying to build an application
which is like a semantic
				wiki.

				Hence users have a quite direct contact
with the underline
				ontology, and
				they can write, with a simplified
syntax, statements about a subject
				they are describing (subject-centric
approach).

				Commiting to the very formal formalism
of OWL looks a bit

				too much...

				formal... ;-) and hard to be handled
with a semantic wiki-like
				application.

				Another problem is that the set could
have properties on

				its own, for

				instance:

				Set1 hasAuthor Jhon

				meaning that John is defining it. But
hasAuthor is

				typically used for

				individuals, and I wouldn't like to fall
in OWL-Full, by

				making an OWL

				reasoner to interpret Set1 both as an
individual and a class.

				Aren't there more informal (although
less precise) methods to model
				sets, or list of individuals?

				An approach could be modeling some sort
of set-theory over
				individuals:

				set1 isA GeneSet
				set1 hasMember g1, g2, g3
				...

				set1 derivesFromUnionOf set2, set3

				...

				But I am not sure it would be a good
approach, or if someone else
				already tried that.

				Any suggestion?


				Thanks in advance for a reply.

				Cheers.

				-- 



	
==============================================================
				========

				=========
				Marco Brandizi <brandizi@ebi.ac.uk>
				http://gca.btbs.unimib.it/brandizi












	
	Bill Bug
	Senior Research Analyst/Ontological Engineer

	Laboratory for Bioimaging  & Anatomical Informatics
	www.neuroterrain.org
	Department of Neurobiology & Anatomy
	Drexel University College of Medicine
	2900 Queen Lane
	Philadelphia, PA    19129
	215 991 8430 (ph)
	610 457 0443 (mobile)
	215 843 9367 (fax)


	Please Note: I now have a new email - William.Bug@DrexelMed.edu




	
	This email and any accompanying attachments are confidential. 
	This information is intended solely for the use of the
individual 
	to whom it is addressed. Any review, disclosure, copying, 
	distribution, or use of this email communication by others is
strictly 
	prohibited. If you are not the intended recipient please notify
us 
	immediately by returning this message to the sender and delete 
	all copies. Thank you for your cooperation.
Received on Saturday, 9 September 2006 20:52:51 UTC