RE: A precedent suggesting a compromise for the SWHCLS IG Best Practices (ARK) from Miller, Michael D (Rosetta) on 2006-08-10 (www-tag@w3.org from August 2006)

From: Miller, Michael D (Rosetta) <Michael_Miller@Rosettabio.com>
Date: Thu, 10 Aug 2006 15:50:35 -0700
To: "Alan Ruttenberg" <alanruttenberg@gmail.com>, "Mark Wilkinson" <markw@illuminae.com>
cc: public-semweb-lifesci@w3.org, www-tag@w3.org
Message-ID: <E1GBJMR-0000G1-M0@lisa.w3.org>
Hi All,

> I frequently see genes, transcripts, dna and mrna and their  
> sequences, proteins, protein sequences, transcripts,  and peptides  
> all confusedly identified by overlapping identifiers. I don't 
> see how  
> any identifier scheme, in itself, lsid's included, currently fixes  
> this problem.   It is this problem that I personally want to see  
> progress on.

You're correct here but it is the state of the art.  Interestingly
enough, I've found that in general the biology-based scientists and
investigators are not all that bothered by this confusion and despite
the confusion seem to make their way through it.

cheers,
Michael

Michael Miller
Lead Software Developer
Rosetta Biosoftware Business Unit
www.rosettabio.com


> -----Original Message-----
> From: public-semweb-lifesci-request@w3.org 
> [mailto:public-semweb-lifesci-request@w3.org] On Behalf Of 
> Alan Ruttenberg
> Sent: Sunday, July 30, 2006 7:08 PM
> To: Mark Wilkinson
> Cc: Alan Ruttenberg; public-semweb-lifesci@w3.org; 
> noah_mendelsohn@us.ibm.com; Sean Martin; Henry S. Thompson; 
> Phillip Lord; www-tag@w3.org; Dan Connolly
> Subject: Re: A precedent suggesting a compromise for the 
> SWHCLS IG Best Practices (ARK)
> 
> 
> 
> Excellent response! I 95% heartedly agree (all but the "I stand by  
> LSIDS part" :)
> 
> I will note however that whenever there are versions of something,  
> there tends to some concept of the thing that they are versions of.  
> So even though there are versions of the sequence, there ought to  
> still be some thing which represents the thing that all the versions  
> are of.
> 
> Back to your point, is there anyone out there who has minted LSIDs  
> for genes and for the sequences distinctly and related them? Do the  
> gene LSIDs ever get versions? Do the sequence LSIDs ever not have  
> versions? When there are different authorities for the genes and  
> sequences, what are the relations that people use to relate them?  
> Let's put these examples on the table.
> 
> If any one has done this in the context of NCBI databases in  
> particular I think it would be helpful to share the specifics of how  
> these ids were used and conceptualized.
> 
> My experience has been that there is routine confusion of the sort  
> that you describe throughout the life sciences community and that  
> this bleeds into the discussion of identifiers (as it just did,  
> though I have to admit I was baiting for exactly this discussion :)
> 
> I frequently see genes, transcripts, dna and mrna and their  
> sequences, proteins, protein sequences, transcripts,  and peptides  
> all confusedly identified by overlapping identifiers. I don't 
> see how  
> any identifier scheme, in itself, lsid's included, currently fixes  
> this problem.   It is this problem that I personally want to see  
> progress on.
> 
> LSID's contract seems more to do with persistence, mutability,  
> cacheability, and discoverability of byte sequences  - not around  
> issues of the identifiers and their relations making 
> ontological sense.
> 
> While I understand that in some contexts the issues around data  
> management are central, they aren't in all contexts. Because I think  
> that optimization of the data management issues, while in some ways  
> elegantly handled by the LSID protocol, aren't central to the issue  
> of representation in the life sciences, and because I don't see LSID  
> addressing the representation issues, I worry that  imposing the use  
> of the LSID protocol puts a burden on all, for the benefit of  
> relatively few.  And for those relatively few who are going 
> to go out  
> of their way to have internal copies of data and the like, I don't  
> see why a custom system that is circumvents http for efficiency  
> reasons is too much of a burden.
> 
> How do you see things otherwise?
> 
> -Alan
> 
> (Being deliberately provocative here - my assigned role in this  
> debate :)
> 
> On Jul 30, 2006, at 9:06 PM, Mark Wilkinson wrote:
> 
> > On Sun, 30 Jul 2006 16:46:21 -0700, Alan Ruttenberg  
> > <alanruttenberg@gmail.com> wrote:
> >
> > I may be speaking out-of-turn here, and should probably let Sean  
> > answer this one since he may have (no doubt) thought-through it  
> > more deeply than I have; however I think you may be mixing up  
> > several different entities here (as so often happens in a URL  
> > world ;-) )
> >
> > In the case you cite above you are likely talking about a "gene",  
> > not a "sequence".  A "gene" will have its own LSID, and it 
> is (even  
> > by the strict genetic definition) a conceptual entity defined by  
> > complementation.  A "gene" and its "sequence" are not the same  
> > thing!  So... I don't see a problem.  When you need to 
> refer to the  
> > gene in the abstract, you can refer to the gene's LSID.  When you  
> > need to talk about a concrete sequence, you refer to *it's* LSID.   
> > The metadata of the gene will (in a sensible world) include 
> triples  
> > that describe its possible sequences, and these will have versions.
> >
> > Genes have many many many properties, so we cannot munge them all  
> > into "sequence".  Certainly, this is how we are modelling our data  
> > locally...
> >
> > I stand by LSID's :-)
> >
> > Mark
> >
> 
> 
> 
>
Received on Thursday, 10 August 2006 22:50:47 UTC