Position Paper on the Life Sciences Identifiers (LSID)

Harold Solbrig

Mayo Clinic

solbrig@mayo.edu

This paper touches on four issues with the current LSID specification. The specific version of the specification that is referenced in this document is the OMG Life Sciences DTC document 04/05/01.

Topics:

The Semantics of Revisions
Incorporation of ISO OID's and DCE UUID's into the LSID namespace
Formalizing the metadata branch
Extending LSID's to represent biomedical ontologies, classification schemes and terminologies

1. The Semantics of Revisions

Section 8 of the OMG LSID specification indicates that "LSIDs are intended to be semantically opaque, in that the LSID assigned to a resource should not be counted on to describe the characteristics or attributes of the resource that the LSID refers to.", yet the notion of "revision" seems to imply some sort of object identity that spans multiple revisions.

The intent of the specification isn't clear regarding the semantics of the revision identifier. With the exception of assignLSIDForNewRevision, revisions appear to be opaque. If the revision identifiers are removed from two LSID's can anything be inferred if the resulting LSID's are lexically identical? Does the authority doing the LSID identifier assignment have any obligation to issue the same LSID authority / namespace / object name for two objects that it are incremental versions of each other? Is the authority precluded from issuing the an LSID that varies only in the revision component if the targets aren't revisions of the same object?

Because of the difficulty involved in answering the above questions, as well as issues that may be introduced when using different types of identifiers (see: ISO OID/DCE UUID discussion below), we propose the following changes

State that the revision component is strictly used for forming an identifier, and that no semantics may be assigned to it. There can be no significance assigned to the fact that two LSID's are (or aren't) identical when the revision component is removed.
Remove any assertions or implications in the LSID documentation regarding how revision identifiers can be ordered. While the specification itself is fairly neutral, some of the accompanying documentation asserts that revision order may correlate with the sort order of the identifier itself.
Create a new method for the LSID resolution service:
```
		Compare_response compare(LSID lsid1, LSID lsid2)
		enum Compare_response {
		     		IDENTICAL_LSID_OBJECT;
		     		LSID1_VERSION_EARLIER_THAN_LSID2;
		     		LSID1_VERSION_LATER_THAN_LSID2;
		     		DIFFERENT_LSID_OBJECTS;
		     		UNABLE_TO_DETERMINE;
		 };
```
Where IDENTICAL_LSID_OBJECT asserts that both lsid1 and lsid2 would return an identical data from the resolver service. Note that this could still be true if the two identifiers were not lexically identical. LSID_VERSION_EARLIER_THAN_LSID2 and LSID_VERSION_LATER_THAN_LSID2 would assert that both LSID's reference different revisions of the same object. The semantics of same will be the responsibility of the resolution service and will not be addressed further in the specification. DIFFERENT_LSID_OBJECTS would assert that the LSID's do not represent the same object, however same is determined. UNABLE_TO_DETERMINE indicates that the service doesn't have sufficient knowledge about one or both of the objects to be able to make any assertions at all about sameness.

2. Incorporation of ISO OID's and DCE UUID's into the LSID namespace

Questions have been raised regarding possible relationships between other forms of unique identifiers and the LSID. One preliminary question that needs to be resolved first is whether there are any existing situations where ISO OID's or DCE UUID's are used to name biomedical entities. ISO OID's are currently use by the ANSI standards organization, Health Level Seven (HL7), in their version 3 messaging structure, but it isn't obvious that this use overlaps with the scope of the LSID. The remainder of this section assumes that legitimate use cases exist.

Background

There are three major identifier assignment schemes in use today:

DNS - the Domain Name System - a hierarchical assignment scheme used in the internet. While popular, the DNS is the most fragile of the assignment schemes because of (a) its close coupling and resulting confusion with URL's and (b) the impermanance of DNS identifier assignment. A sample DNS identifier is "informatics.mayo.edu".

ISO OID - ISO object identifiers as specified in section 31 or ITU-T X.680 (07/2002) Information technology - Abstract Syntax Notation One (ASN.1): Specification of basic notation - ISO object identifiers consist of a sequence of dot-separated integers that represent a hierarchy of namespace assignment authorities. As an example, the ISO identifier 2.16.840.1.113882 can be broken down as:

            2   - joint-iso-itu-t
           16   - country assignments
          840   - United States
            1   - US company
       113883   - Health Level Seven, Inc (HL7)

which represents the ANSI healthcare standards organization, Health Level Seven (HL7). HL7 can allocate its own identifiers under its base node - 2.16.840.1.113883.6.66, for example, represents Unified Medical Language System(UMLS) - a reference terminology for healthcare.

It should be noted that the ASN.1 specification also defines a human readable (name form) of the object identifier, but the use of this form has been deprecated in recent releases of the specification in favor of using exclusively the (number form). ISO identifiers tend to be more resilient because of the permanance of the assignment tree. They occur less frequently however, due to readibility issues and the fact that ISO charges a stiff fee to formally register these identifiers.

DCE UUID - The Universal Unique Identifier (UUID), as specified OSF Distributing Computing Environment(DCE) - a 128-bit opaque identifier that is generating using a combination of a network MAC address, a timestamp and a random number. UUID's can either be represented in binary or hexidecimal notation. The hexidecimal notation takes the form "hhhhhhhh-hhhh-hhhh-hhhh-hhhhhhhhhhhh" where 'h' is a hexidecimal digit. As an example, the UUID 5d1cb710-1c4b-11d4-bed5-005004b1f42f represents a component in the TortiseCVS system. DCE UUID's are pervasive throughout the Microsoft Windows operating system, (although Microsoft calls them "GUID's"). Anyone with a computer has the capacity of creating an almost limitless supply of UUID's on demand. Like ISO OID's, UUIDs are not humanly reatible. Unlike OIDS and DNS identifiers, however, UUID's do not require any consultation with a central authority which often makes it more difficult to determine whether similar or identical objects already exist.

All three of these identifiers have assigned URN schemes:

Scheme	URN prefix	Example
DNS	urn:dns	urn:dns:informatics.mayo.edu
ISO	urn:oid:	urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6
DCE	urn:uuid:	urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6

The DNS scheme is already used in the LSID and won't be discussed further

Identifiers and LSID's

There are two different situations that need to be considered when discussing ISO OIDs and DCE UUIDs with respect to the LSID. The first situation is where the OID or UUID occurs in its "pure" form (e.g. urn:iso:2.16.840.1.113883.6.56 or urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6). The question that needs to be answered here is whether there is anything to be gained by switching from the "uuid" or "iso" scope to "lsid", or whether it would make as much sense to leave the identifiers as they stand. We would argue in favor of the switch, because it allows the use of the additional semantics (e.g. resolution, assignment) embodied in the "lsid" prefix. To incorporate "pure" identifiers, we propose that the LSID specification reserve the following authority identifiers:

Authority identifier	Identification scheme	Example
org.dns	DNS	urn:lsid:org.dns::informatics.mayo.edu:
org.iso	ISO OID	urn:lsid:org.iso::2.16.840.1.113883.6.56:
org.dce	DCE UUID	urn:lsid:org.iso::f81d4fae-7dec-11d0-a765-00a0c91e6bf61:

An alternative approach would be to expand the definition of the resolution service to include of non-lsid identifiers, but this would still leave the open question of when to invoke the service.

The second situation is more common - the situation where OID's or UUID's are already intermingled with other namespaces, object identifiers or revision information. This is the situation within HL7, where ISO OID's are intermingled with external tokens. The HL7 identifier for the UMLS concept "malignant neoplasm", for example, is urn:iso:2.16.840.1.113883.6.66#C0006826 and the ICD-9-CM classification concept "Asthma, unspecified" is urn:iso:2.16.840.1.113883.6.2#493.9 .

Possibilities:

ISO Identifiers - Two options that might be considered onsidered when integrating ISO identifiers and LSIDs:
1. Consider the ISO identifier entirely opaque - a namespace managed by the owing authority. Using this approach, an HL7 identifier for the UMLS might take one of the forms:
```
                
                    urn:lsid:v3.hl7.org:2.16.840.1.113883:6.66
                            -or-
                    urn:lsid:v3.hl7.org:2.16.840.1.113883.6:66
                            -or-
                    urn:lsid:v3.hl7.org:2.16.840.1.113883.6.66:
```
  It would be the responsibility of the authority, which is HL7 in the above example, to decide how the identifier is assigned. NOTE: the third form above is not a legal LSID as it exists today - see the proposal at the end of this section.
2. Treat the ISO identifier as a different authority. Using this form, the LSID for the UMLS would be:
```
							urn:lsid:2.16.840.1.113883.6.66::
                    
  And the concept code for "malignant neoplasm" would be represented as
 	urn:lsid:2.16.840.1.113883.6.66:C0006826
						
```
  A potential drawback of this approach is that it may not generalize, as there is no guarantee that the namespaces of OID's, DN's, etc will always be distinct.
DCE identifiers - Unlike DNS and ISO identifiers, DCE UUIDs are totally opaque. As there is no embedded information about of the source of the assignment, knowledge of domain and assignment authorities has to be external to the UUID itself. While the LSID is "semantically opaque", we believe that the partitioning of the identifier into authority, namespace and object (not revision - see Issue 1 above) renders it considerably more approachable from the assignment perspective. We would propose that DCE UUID encoding be managed by an authority. The identifier for the TortiseCVS component, for example might be represented as:
```
        
                urn:lsid:www.tortisecvs.org:software:FD09CEFE-3502-47d6-908D-F92428A27F64
 
            
            Note that the "software" namespace is somewhat contrived, and perhaps should be omitted.
            
```

Proposal

1) Change the LSID syntax to allow an empty namespace or object identifier (but not both). This would allow pure identifiers to be represented as:

			urn:lsid:org.iso::2.16.840.1.113883.6.56
          		or
          urn:lsid:org.iso:2.16.840.1.113883.6.56:
                
                        and
                
           urn:lsid:org.dce:FD09CEFE-3502-47d6-908D-F92428A27F64:
           		or
           urn:lsid:org.dce::FD09CEFE-3502-47d6-908D-F92428A27F64

This would allow other mixtures as described above. The decision as to whether the identifier represented a namespace or an object would be up to the original assigner.

2) State that the mapping between ISO OID's and DCE UUID's are the responsibility of the authority that is ultimately responsible for the objects represented by the identifiers.

3. Formalizating the metadata branch

The LSID document makes some reference to what might be included in the metadata branch. Page 11 states "The Metadata_document" is (usually) a string containing the metadata itself. It is considered out of the scope of this specification to restrict the number of formats the metadata can be returned in. The most popular and expected formats are, however, RDF and XMI." Page 10 asserts, "This method is used to return a document containing the metadata associated with a particular LSID at this particular data retrieval service. Note that this means that calling getMetadata on two different data services may yield different metadata since each service may contain different metadata about the same lsid."

"Metadata" covers a broad spectrum of meaning. There are at least three classes of metadata that would be useful to be able to retrieve along with an LSID:

Provenance - who created the information, why, how, where and when it was created along with review, revision, etc
Structural - how is the information represented? What is the format, what versions of schema's, tables, specifications, etc. need to be used to understand the information content. Structural information is dependent upon the resource. ASN.1 BER encodings have a radically different structure than, say, an HTML web page. It is important, however, that the LSID Resolution Service be able to supply enough information that a client will be able render the information a known format.
Definitional - what is this information about? What organism(s) does it represent, what components, disease state, process, etc. is associated with it. Today's biological data is beginning to come pre-annotated with references into the Gene Ontology (GO), the NCI Metathesaurus or other ontologies, classification schemes and terminologies. This information needs to be viewed as an integral part of the data itself and efforts should be made to keep these integrated.

We believe that it is necessary to begin to include at least baseline representations, if not requirements for this information within an extended LSID specification. There is little value to being able to retrieve a stream of bytes associated with a specific identifier ten years in the future if the information about how the data has been encoded has been lost. Similarily, there is little use to being to retrive a sequence of nucleotides if one doesn't know what organism and locus that they were derived from.

Recommendations:

Provenance The LSID specification should select a set of elements from the Dublin Core. The specification should then (a) recommend that, whenever possible, information sources include this information in the metadata branch of the LSID response and (b) require that all implementations be able to return this same information when asked. Note that implementions can still add information, but they cannot remove information as supplied by the originator.
Structural We do not believe that the LSID can mandate a specific representational format. We do believe, however that the specification should define the following metadata elements:
1. A mandatory element that describes the format of the data represented by the LSID
2. An optional element that will return the appropriate structural definition (if any) for the supplied format. As an example, if the format of the data was "text/xml", the structural definition would be expected to take the form of a corresponding XML DTD (?) or Schema.
3. An optional element that can identify a resource that contains rendering tools. As an example, a genbank resource encoded in ASN.1 might contain the URL where the appropriate decoding and software can be downloaded
4. An optional element that allows the service to list other available formats for the information represented by the LSID and the corresponding identifiers for each of these formats.
We see item (4) above as a simpler solution than trying to incorporate multiple renderings for the same LSID.
Definitional Define an optional set of metadata elements that can uniquely identify a classficiation scheme (ontology, terminology), the version of the scheme and a a list of concept codes (class, term) drawn from these schemes. The specification that should (a) recommend that the original information should be accompanied by this information where appropriate and (b) require that *all* implementations return the same set of metadata elements that were provided by the original source.

4. Extending LSID's to represent biomedical ontologies, classification schemes and terminologies (OCT's)

Background

Recommendation 3 above discusses the ability to unambigouosly identify concept codes, classes or terms from the corresponding classification scheme, ontology or terminology. At the moment there is not a universally accepted approach to doing as much. Existing approaches include:

The NamingAuthority section of the OMG Lexicon Query Services (LQS) specification: While this specification preceeded the development of URN's, it is quite similar. An concept identifier takes the format: <egistration authority>:<naming entity>:<local name>, where <registration authority> was one of "DNS", "DCE", "ISO", "IR" or "Other", which indicated that the naming entity was assigned by the domain name system, a DCE UUID, an ISO OID, an OMG Interface Repository identifier or none of the above. <naming entity> and <local name> were assigned respectively.
The Unified Medical Language System(UMLS): The UMLS is managed by the National Library of Medicine and represents a collection of "2.8 million concept names from more than 100 controlled vocabularies". All 2.8 million names are mapped into a single namespace. Using this approach, there is only one namespace, the UMLS.
RDF Schema /OWL namespaces: The W3C uses namespaces and namespace identifiers to name ontologies and classification systems. As an example, the "wine ontology" that is used throughout the W3C examples has been assigned the URN "http://www.w3.org/TR/2003/CR-owl-guide-20030818/wine#". This URN is then used within RDF Schema and OWL documents by:
1. assigning an arbitrary namespace identifier (e.g. xmlns:vin = "http://www.w3.org/TR/2003/CR-owl-guide-20030818/wine#")
2. assigning an "ENTITY" definition for attributes (e.g. <!ENTITY vin "http://www.w3.org/TR/2003/CR-owl-guide-20030818/wine#" >)
And then using them throught the document (e.g. <owl:allValuesFrom rdf:resource="&vin;#Winery" /> and (<vin:Winery rdf:ID="Foxen" /> )
Health Level Seven (HL7) Version 3: HL7 assigns an ISO OID to each code system if one isn't already supplied. Concept identifiers are represented in the form: URN:ISO:<oid>#<concept code>

None of the approaches described above are totally adequate. The LQS specification pre-dates the URN syntax and there is ambiguity about what constitutes the naming entity and what constitutes the local name. The UMLS approach requires a well-funded central authority which doesn't scale well in a distributed environment. The RDF Schema approach is XML-centric and says nothing about how namespace identifiers are assigned. None of the specifications address revisions in a meaningful way.

The LSID syntax and functional behavior seems to fit many of the baseline requirements of classification ids. This LSID specification mentions classes on page 8 - "An LSID usually represents a piece of data, but it is allowed to have LSIDs representing an(sic) abstract entities or concepts. ... If an LSID represents an abstract entity the LSID resolution service must always resolve an empty result". While this leaves the door open, we need to define a services service that delivers more than an empty result and optional metadata.

Recommendation

The LSID syntax is a good fit with the requirements of biomedical OCTs. The resolution, resolution discovery and assigning service services also provide a good fit if you treat "object" as synonymous with "class" or "concept". The advantage of including OCTs in the LSID scope statement would be a reduction in the number of interfaces and tools that would need to be producted. There are several disadvantages, including the risk of scope creep for the core LSID project, the fact that OCTs will need further specification when it comes to format, classification, and other services. We suggest that these issues be opened for discussion and further refinement.