Re: A precedent suggesting a compromise for the SWHCLS IG Best Practices (ARK) from Alan Ruttenberg on 2006-07-31 (www-tag@w3.org from July 2006)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Sun, 30 Jul 2006 22:07:30 -0400
To: "Mark Wilkinson" <markw@illuminae.com>
Cc: Alan Ruttenberg <alanruttenberg@gmail.com>, public-semweb-lifesci@w3.org, noah_mendelsohn@us.ibm.com, "Sean Martin" <sjmm@us.ibm.com>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, "Phillip Lord" <phillip.lord@newcastle.ac.uk>, www-tag@w3.org, "Dan Connolly" <connolly@w3.org>
Message-Id: <3CC0BB02-E115-489F-B609-5CAE51BD2860@gmail.com>

Excellent response! I 95% heartedly agree (all but the "I stand by  
LSIDS part" :)

I will note however that whenever there are versions of something,  
there tends to some concept of the thing that they are versions of.  
So even though there are versions of the sequence, there ought to  
still be some thing which represents the thing that all the versions  
are of.

Back to your point, is there anyone out there who has minted LSIDs  
for genes and for the sequences distinctly and related them? Do the  
gene LSIDs ever get versions? Do the sequence LSIDs ever not have  
versions? When there are different authorities for the genes and  
sequences, what are the relations that people use to relate them?  
Let's put these examples on the table.

If any one has done this in the context of NCBI databases in  
particular I think it would be helpful to share the specifics of how  
these ids were used and conceptualized.

My experience has been that there is routine confusion of the sort  
that you describe throughout the life sciences community and that  
this bleeds into the discussion of identifiers (as it just did,  
though I have to admit I was baiting for exactly this discussion :)

I frequently see genes, transcripts, dna and mrna and their  
sequences, proteins, protein sequences, transcripts,  and peptides  
all confusedly identified by overlapping identifiers. I don't see how  
any identifier scheme, in itself, lsid's included, currently fixes  
this problem.   It is this problem that I personally want to see  
progress on.

LSID's contract seems more to do with persistence, mutability,  
cacheability, and discoverability of byte sequences  - not around  
issues of the identifiers and their relations making ontological sense.

While I understand that in some contexts the issues around data  
management are central, they aren't in all contexts. Because I think  
that optimization of the data management issues, while in some ways  
elegantly handled by the LSID protocol, aren't central to the issue  
of representation in the life sciences, and because I don't see LSID  
addressing the representation issues, I worry that  imposing the use  
of the LSID protocol puts a burden on all, for the benefit of  
relatively few.  And for those relatively few who are going to go out  
of their way to have internal copies of data and the like, I don't  
see why a custom system that is circumvents http for efficiency  
reasons is too much of a burden.

How do you see things otherwise?

-Alan

(Being deliberately provocative here - my assigned role in this  
debate :)

On Jul 30, 2006, at 9:06 PM, Mark Wilkinson wrote:

> On Sun, 30 Jul 2006 16:46:21 -0700, Alan Ruttenberg  
> <alanruttenberg@gmail.com> wrote:
>
> I may be speaking out-of-turn here, and should probably let Sean  
> answer this one since he may have (no doubt) thought-through it  
> more deeply than I have; however I think you may be mixing up  
> several different entities here (as so often happens in a URL  
> world ;-) )
>
> In the case you cite above you are likely talking about a "gene",  
> not a "sequence".  A "gene" will have its own LSID, and it is (even  
> by the strict genetic definition) a conceptual entity defined by  
> complementation.  A "gene" and its "sequence" are not the same  
> thing!  So... I don't see a problem.  When you need to refer to the  
> gene in the abstract, you can refer to the gene's LSID.  When you  
> need to talk about a concrete sequence, you refer to *it's* LSID.   
> The metadata of the gene will (in a sensible world) include triples  
> that describe its possible sequences, and these will have versions.
>
> Genes have many many many properties, so we cannot munge them all  
> into "sequence".  Certainly, this is how we are modelling our data  
> locally...
>
> I stand by LSID's :-)
>
> Mark
>

Received on Monday, 31 July 2006 02:07:51 UTC