- From: Sean Martin <sjmm@us.ibm.com>
- Date: Tue, 4 May 2004 08:41:20 -0400
- To: SLetovsky@aol.com
- Cc: public-semweb-lifesci@w3.org
- Message-ID: <OF7725F40B.A857AF2F-ON85256E8A.000872D9-85256E8A.0045B451@us.ibm.com>
Hi Stan, The LSID standard was created to help to start to solve many of the problems you outline. How data providers and end users go about implementing the spec. (and heeding the social/technical contracts) will determine how successfully LSID's will be able to do this. No doubt much is missing, but my feeling is that it is a reasonable start. SLetovsky@aol.com wrote on 05/03/2004 08:05:38 PM: > All, > > Dumb question for this LSID thread, since there seem to be > people on it who understand the goals of LSID well. In my experience > a critical problem in bioinformatics is a diversity of identifers > for the same thing (typically a gene or gene product sequence). In general, it was intended that the LSID provide a prefix to many of the existing identifier schemes to uniquely identify/disambiguate each object and exactly where it comes from. e.g. for protein urn:lsid:pdb.org:pdb:1AFT instead of ASCII string 1AFT or urn:lsid:ncbi.nih.gov:pubmed:12344 instead of PMID12344 etc. It is hoped that retrofitting LSID's to most existing data sources/identifier schemes will be fairly obvious and easy. It should not require any great changes on the part of the data provider as the LSID resolver software is usually layered above whatever the existing data access mechanism and database schemas exist for that data today. Software will be able to identify and source objects by their unique names. These LSID names can be used by third party applications to associate additional metadata with those objects (i.e. annotate them accurately/unambiguously). If RDF (using widely shared vocabularies) is used to do this, this metadata can include the semantics between LSID's as well as literal data. > These identifiers > typically come from different namespaces or from biological > nomenclatures. A frequent and time-consuming problem is unifying > datasets from different sources which refer to these > objects using different symbols, necessitating a synonym-aware > relational join. This sounds > simpler than it is in practice; the synonym relations display all > manner of cardinality other than the hoped-for one-to-one. It is hoped that by using standardized RDF (using LSID's or some other URI) predicates in the metadata describing the relationships between objects named by LSID, that these relationships will be defined more accurately (and uniformly) and in a manner that is more easily automatically machine readable. Would this start to solve the problem you detail? >To > further complicate matters, all namespaces evolve, adding, retiring, > splitting and merging previously allocated identifiers, a process that > reflects ongoing refinement of the underlying biology. There is no > systematic versioning > of namespaces or datasets. The LSID "contract" allows for an LSID to name only one object ever. If a newer version of the object is created, then a new LSID must be created, usually with the "version" element of the LSID incremented in some fashion. If one is referring to an LSID which is a concept (e.g. a protein name or a gene name) for which multiple data sets and/or versions of that data exist, that LSID will generally not name a single object (i.e. will have 0 bytes directly named by it), instead it will have extensive metadata associated with it that will point to other LSID's that name "concrete" versions of objects (by LSID) that do name actual bytes. The metadata would contain RDF predicates which indicate that perhaps a related LSID is the latest version of the concept expressed, or that it is derived from data named by another LSID etc. Exploring the metadata web around a set of LSID's would reveal all the relationships required to understand what data is there and how each relates to the rest. > > Clearly stable gene identifiers maintained by authoritative sources > and used by all > producers of data would be a big help, but despite the best efforts > of the MODs (model organism databases), groups such as NCBI, EBI, > UCSC, etc., the problem of resolving In an ideal world everyone would use the same LSID's to identify the same concepts. It may well take time to get there - if we ever do. However by using RDF and shared vocabularies now, we now at least have a means of expressing the semantics between objects and concepts from different sources in a fashion that is machine (and then human) readable. By unambiguously naming objects, it is possible for the original data providers or perhaps third party databases (like Locus link or Omim) to accurately draw links between objects that are "sameAs" or "derivedFrom" or "relatedTo" etc. > identifier references still routinely crops up in day-to-day > bioinformatics work, and > generates a lot of frustration and wasted time. > My understanding is that this is why various folks got together to try and do something about it with LSID's. The EBI was a co-sponsor of the standard. > My question is, do LSIDs address this issue? Once he problem has > been translated from > bioinformatics-speak to w3c-speak I can no longer tell. My > impression is that LSIDs are concerned more with hostname > independence rather than semantic equivalence. Not at all.. or perhaps not only :-) Kindest regards, Sean
Received on Tuesday, 4 May 2004 08:42:09 UTC