Re: Fw: Use of LSIDs in RDF (fwd) from Sean Martin on 2004-05-04 (public-semweb-lifesci@w3.org from May 2004)

From: Sean Martin <sjmm@us.ibm.com>
Date: Tue, 4 May 2004 08:41:20 -0400
To: SLetovsky@aol.com
Cc: public-semweb-lifesci@w3.org
Message-ID: <OF7725F40B.A857AF2F-ON85256E8A.000872D9-85256E8A.0045B451@us.ibm.com>
Hi Stan,
The LSID standard was created  to help to start to solve many of the 
problems you outline. How data providers and end users go about 
implementing the spec. (and heeding the social/technical contracts)  will 
determine how successfully LSID's will be able to do this. No doubt much 
is missing, but my feeling is that it is a reasonable start.

SLetovsky@aol.com wrote on 05/03/2004 08:05:38 PM:

> All,
>  
>     Dumb question for this LSID thread, since there seem to be 
> people on it who understand the goals of LSID well. In my experience
> a critical problem in bioinformatics is a diversity of identifers 
> for the same thing (typically a gene or gene product sequence). 

In general, it was intended that the LSID provide a prefix to many of the 
existing identifier schemes to uniquely identify/disambiguate each object 
and exactly where it comes from.  e.g. for protein 
urn:lsid:pdb.org:pdb:1AFT instead of ASCII string 1AFT or 
urn:lsid:ncbi.nih.gov:pubmed:12344 instead of PMID12344 etc. It is hoped 
that retrofitting LSID's to most existing data sources/identifier schemes 
will be fairly obvious and easy. It should not require any great changes 
on the part of the data provider as the LSID resolver software is usually 
layered above whatever the existing data access mechanism and database 
schemas exist for that data today. Software will be able to identify and 
source objects by their unique names. These LSID names can be used by 
third party applications to associate additional metadata with those 
objects (i.e. annotate them accurately/unambiguously). If RDF (using 
widely shared vocabularies) is used to do this, this metadata can include 
the semantics between LSID's as well as literal data.

> These identifiers
> typically come from different namespaces or from biological 
> nomenclatures. A frequent and time-consuming problem is unifying 
> datasets from different sources which refer to these
> objects using different symbols, necessitating a synonym-aware 
> relational join. This sounds
> simpler than it is in practice; the synonym relations display all 
> manner of cardinality other than the hoped-for one-to-one. 

It is hoped that by using standardized RDF (using LSID's or some other 
URI) predicates in the metadata describing the relationships between 
objects named by LSID, that these relationships will be defined more 
accurately (and uniformly) and in a manner that is more easily 
automatically machine readable. Would this start to solve the problem you 
detail? 

>To 
> further complicate matters, all namespaces evolve, adding, retiring,
> splitting and merging previously allocated identifiers, a process that
> reflects ongoing refinement of the underlying biology. There is no 
> systematic versioning
> of namespaces or datasets.

The LSID "contract" allows for an LSID to name only one object ever. If a 
newer version of the object is created, then a new LSID must be created, 
usually with the "version" element of the LSID incremented in some 
fashion. If one is referring to an LSID which is a concept (e.g. a protein 
name or a gene name) for which multiple data sets and/or versions of that 
data exist, that LSID will  generally not name a single object (i.e. will 
have 0 bytes directly named by it), instead it will have extensive 
metadata associated with it that will point to other LSID's that name 
"concrete" versions of objects (by LSID) that do name actual bytes. The 
metadata would contain RDF predicates which indicate that perhaps a 
related LSID is the latest version of the concept expressed, or that it is 
derived from data named by another LSID etc. Exploring the metadata web 
around a set of LSID's would reveal all the relationships required to 
understand what data is there and how each relates to the rest. 


>  
> Clearly stable gene identifiers maintained by authoritative sources 
> and used by all
> producers of data would be a big help, but despite the best efforts 
> of the MODs (model organism databases), groups such as NCBI, EBI, 
> UCSC, etc., the problem of resolving

In an ideal world everyone would use the same LSID's to identify the same 
concepts. It may well take time to get there - if we ever do. However by 
using RDF and shared vocabularies now, we now at least have a means of 
expressing the semantics between objects and concepts from different 
sources in a fashion that is machine (and then human) readable. By 
unambiguously naming objects, it is possible for the original data 
providers or perhaps third party databases (like Locus link or Omim) to 
accurately draw links between objects that are "sameAs" or "derivedFrom" 
or "relatedTo" etc.


> identifier references still routinely crops up in day-to-day 
> bioinformatics work, and
> generates a lot of frustration and wasted time.
>  

My understanding is that this is why various folks got together to try and 
do something about it with LSID's. The EBI was a co-sponsor of the 
standard.


> My question is, do LSIDs address this issue? Once he problem has 
> been translated from
> bioinformatics-speak to w3c-speak I can no longer tell. My 
> impression is that LSIDs are concerned more with hostname 
> independence rather than semantic equivalence.

Not at all.. or perhaps not only :-)

Kindest regards, Sean
Received on Tuesday, 4 May 2004 08:42:09 UTC