Re: URL +1, LSID -1

Alan Ruttenberg wrote:
>> Some resources are quite simple and straightforward to understand, 
>> e.g. represents a 
>> specific amino acid sequence,
> The instances are sequences of letters? Qualities of a class of 
> molecules? The molecules themselves?

I guess you could say it represents the class of molecules with the 
specific sequence of amino acids. Of course the next complication is what 
is an amino acid, for example selenocysteine is often treated as cysteine.

>> The resources in the namespace are a 
>> bit more complicated, basically it's annotation for a sequence in an 
>> organism:

> Are there sequences in organisms? Or are there polypeptides? Which do 
> the records represent? If the proteins, then in all states - unfolded, 
> folded, misfolded, phophorylated, glycosylated etc?

The descriptions of phosphorylations and glycolsylations and whatnot are 
all associated with the same protein resource. They don't have any stable 
identifiers of their own, at the moment, though this could change easily.

Information about different folding states etc is mostly in free text form, 
e.g. for O15354 you can read that "When overexpressed in cells, tends to 
become insoluble and unfolded. Accumulation of the unfolded protein may 
lead to dopaminergic neuronal death in juvenile Parkinson disease (PDJ)." 
The general trend is to move from free text to explicit, structured 
representations of the data, but this isn't always easy or practical.

> Do  the set of sequences/proteins include common(in the organism's 
> population) non-function-changing mutants?

Yes, e.g. in, check 
the "amino acid modifications", "natural variations" and the "experimental 
info" subsections. For some of the variants we have stable identifiers, 
e.g. (PURL doesn't work yet).

>> (Human)
>> (same sequence, but Dog)
> What is the same about them?

Well, the sequence (or whatever you want to call it)...

>> ...but these resources may also include annotation for related 
>> sequences produced e.g. by alternative splicing:
>> (Human, 3 sequences)
>> ...provided the function of the resulting sequences are not so 
>> different that they warrant resources of their own...
> How different do they have to be?

I'm afraid there is no single, clear cut set of rules...

> These might seem to be silly questions "everyone knows what they mean", 
> but I don't think so. Would you use these identifiers to uniquely enough 
> identify a protein if your life depended on it? I think that this is the 
> standard that we should be aiming for - after all, people's lives 
> do/will depend on it.

Maybe people's lives depend on us not diverting too much time and energy 
into trying to be more accurate and consistent than is helpful...

> What I'm trying to point out with these questions is that the uniprot 
> records are not trivially interpretable as "concepts", and that it might 
> be better to not even try in the first place. Rather leave them be 
> database records, and separately create an ontology of proteins that 
> might use the records, or aspects of the records in part of the formal 
> definitions of those proteins.

You are assuming there are ideal, abstract definitions of proteins that are 
logically consistent while being practical enough for everyone to use, and 
moreover that you can get people to agree upon... Good luck with that :-)

Received on Wednesday, 11 July 2007 12:37:16 UTC