This paper touches on four issues with the current LSID specification. The specific version of the specification that is referenced in this document is the OMG Life Sciences DTC document 04/05/01.
Section 8 of the OMG LSID specification indicates that "LSIDs are intended to be semantically opaque, in that the LSID assigned to a resource should not be counted on to describe the characteristics or attributes of the resource that the LSID refers to.", yet the notion of "revision" seems to imply some sort of object identity that spans multiple revisions.
The intent of the specification isn't clear regarding the semantics of the revision identifier. With the exception of assignLSIDForNewRevision, revisions appear to be opaque. If the revision identifiers are removed from two LSID's can anything be inferred if the resulting LSID's are lexically identical? Does the authority doing the LSID identifier assignment have any obligation to issue the same LSID authority / namespace / object name for two objects that it are incremental versions of each other? Is the authority precluded from issuing the an LSID that varies only in the revision component if the targets aren't revisions of the same object?
Because of the difficulty involved in answering the above questions, as well as issues that may be introduced when using different types of identifiers (see: ISO OID/DCE UUID discussion below), we propose the following changes
Compare_response compare(LSID lsid1, LSID lsid2) enum Compare_response { IDENTICAL_LSID_OBJECT; LSID1_VERSION_EARLIER_THAN_LSID2; LSID1_VERSION_LATER_THAN_LSID2; DIFFERENT_LSID_OBJECTS; UNABLE_TO_DETERMINE; };Where IDENTICAL_LSID_OBJECT asserts that both lsid1 and lsid2 would return an identical data from the resolver service. Note that this could still be true if the two identifiers were not lexically identical. LSID_VERSION_EARLIER_THAN_LSID2 and LSID_VERSION_LATER_THAN_LSID2 would assert that both LSID's reference different revisions of the same object. The semantics of same will be the responsibility of the resolution service and will not be addressed further in the specification. DIFFERENT_LSID_OBJECTS would assert that the LSID's do not represent the same object, however same is determined. UNABLE_TO_DETERMINE indicates that the service doesn't have sufficient knowledge about one or both of the objects to be able to make any assertions at all about sameness.
Questions have been raised regarding possible relationships between other forms of unique identifiers and the LSID. One preliminary question that needs to be resolved first is whether there are any existing situations where ISO OID's or DCE UUID's are used to name biomedical entities. ISO OID's are currently use by the ANSI standards organization, Health Level Seven (HL7), in their version 3 messaging structure, but it isn't obvious that this use overlaps with the scope of the LSID. The remainder of this section assumes that legitimate use cases exist.
DNS - the Domain Name System
- a hierarchical assignment scheme used in the internet. While popular, the DNS is the most fragile of the assignment schemes because of (a) its close coupling and resulting confusion with URL's and (b) the impermanance of DNS identifier assignment. A sample DNS identifier is "informatics.mayo.edu".
ISO OID - ISO object identifiers as specified in section 31 or ITU-T X.680 (07/2002) Information technology - Abstract Syntax Notation One (ASN.1): Specification of basic
notation
- ISO object identifiers consist of a sequence of dot-separated integers that represent a hierarchy of namespace assignment authorities. As an example, the ISO identifier 2.16.840.1.113882 can be broken down as:
2 - joint-iso-itu-t 16 - country assignments 840 - United States 1 - US company 113883 - Health Level Seven, Inc (HL7)
which represents the ANSI healthcare standards organization, Health Level Seven (HL7). HL7 can allocate its own identifiers under its base node - 2.16.840.1.113883.6.66, for example, represents Unified Medical Language System(UMLS) - a reference terminology for healthcare.
It should be noted that the ASN.1 specification also defines a human readable (name form) of the object identifier, but the use of this form has been deprecated in recent releases of the specification in favor of using exclusively the (number form). ISO identifiers tend to be more resilient because of the permanance of the assignment tree. They occur less frequently however, due to readibility issues and the fact that ISO charges a stiff fee to formally register these identifiers.
DCE UUID - The Universal Unique Identifier (UUID), as specified OSF Distributing Computing Environment(DCE)
- a 128-bit opaque identifier that is generating using a combination of a network MAC address, a timestamp and a random number. UUID's can either be represented in binary or hexidecimal notation. The hexidecimal notation takes the form "hhhhhhhh-hhhh-hhhh-hhhh-hhhhhhhhhhhh" where 'h' is a hexidecimal digit. As an example, the UUID 5d1cb710-1c4b-11d4-bed5-005004b1f42f represents a component in the TortiseCVS system.
DCE UUID's are pervasive throughout the Microsoft Windows operating system, (although Microsoft calls them "GUID's"). Anyone with a computer has the capacity of creating an almost limitless supply of UUID's on demand. Like ISO OID's, UUIDs are not humanly reatible. Unlike OIDS and DNS identifiers, however, UUID's do not require any consultation with a central authority which often makes it more difficult to determine whether similar or identical objects already exist.
All three of these identifiers have assigned URN schemes:
Scheme | URN prefix | Example |
---|---|---|
DNS | urn:dns | urn:dns:informatics.mayo.edu |
ISO | urn:oid: | urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6 |
DCE | urn:uuid: | urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6 |
The DNS scheme is already used in the LSID and won't be discussed further
There are two different situations that need to be considered when discussing ISO OIDs and DCE UUIDs with respect to the LSID. The first situation is where the OID or UUID occurs in its "pure" form (e.g. urn:iso:2.16.840.1.113883.6.56 or urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6). The question that needs to be answered here is whether there is anything to be gained by switching from the "uuid" or "iso" scope to "lsid", or whether it would make as much sense to leave the identifiers as they stand. We would argue in favor of the switch, because it allows the use of the additional semantics (e.g. resolution, assignment) embodied in the "lsid" prefix. To incorporate "pure" identifiers, we propose that the LSID specification reserve the following authority identifiers:
Authority identifier | Identification scheme | Example |
---|---|---|
org.dns | DNS | urn:lsid:org.dns::informatics.mayo.edu: |
org.iso | ISO OID | urn:lsid:org.iso::2.16.840.1.113883.6.56: |
org.dce | DCE UUID | urn:lsid:org.iso::f81d4fae-7dec-11d0-a765-00a0c91e6bf61: |
The second situation is more common - the situation where OID's or UUID's are already intermingled with other namespaces, object identifiers or revision information. This is the situation within HL7, where ISO OID's are intermingled with external tokens. The HL7 identifier for the UMLS concept "malignant neoplasm", for example, is urn:iso:2.16.840.1.113883.6.66#C0006826
and the ICD-9-CM classification concept "Asthma, unspecified" is urn:iso:2.16.840.1.113883.6.2#493.9 .
Possibilities:
urn:lsid:v3.hl7.org:2.16.840.1.113883:6.66 -or- urn:lsid:v3.hl7.org:2.16.840.1.113883.6:66 -or- urn:lsid:v3.hl7.org:2.16.840.1.113883.6.66:It would be the responsibility of the authority, which is HL7 in the above example, to decide how the identifier is assigned. NOTE: the third form above is not a legal LSID as it exists today - see the proposal at the end of this section.
A potential drawback of this approach is that it may not generalize, as there is no guarantee that the namespaces of OID's, DN's, etc will always be distinct.urn:lsid:2.16.840.1.113883.6.66::
And the concept code for "malignant neoplasm" would be represented asurn:lsid:2.16.840.1.113883.6.66:C0006826
urn:lsid:www.tortisecvs.org:software:FD09CEFE-3502-47d6-908D-F92428A27F64 Note that the "software" namespace is somewhat contrived, and perhaps should be omitted.
urn:lsid:org.iso::2.16.840.1.113883.6.56
orurn:lsid:org.iso:2.16.840.1.113883.6.56:
andurn:lsid:org.dce:FD09CEFE-3502-47d6-908D-F92428A27F64:
orurn:lsid:org.dce::FD09CEFE-3502-47d6-908D-F92428A27F64
This would allow other mixtures as described above. The decision as to whether the identifier represented a namespace or an object would be up to the original assigner.
2) State that the mapping between ISO OID's and DCE UUID's are the responsibility of the authority that is ultimately responsible for the objects represented by the identifiers.
The LSID document makes some reference to what might be included in the metadata branch. Page 11 states "The Metadata_document" is (usually) a string containing the metadata itself. It is considered out of the scope of this specification to restrict the number of formats the metadata can be returned in. The most popular and expected formats are, however, RDF and XMI." Page 10 asserts, "This method is used to return a document containing the metadata associated with a particular LSID at this particular data retrieval service. Note that this means that calling getMetadata on two different data services may yield different metadata since each service may contain different metadata about the same lsid."
"Metadata" covers a broad spectrum of meaning. There are at least three classes of metadata that would be useful to be able to retrieve along with an LSID:
We believe that it is necessary to begin to include at least baseline representations, if not requirements for this information within an extended LSID specification. There is little value to being able to retrieve a stream of bytes associated with a specific identifier ten years in the future if the information about how the data has been encoded has been lost. Similarily, there is little use to being to retrive a sequence of nucleotides if one doesn't know what organism and locus that they were derived from.
Recommendation 3 above discusses the ability to unambigouosly identify concept codes, classes or terms from the corresponding classification scheme, ontology or terminology. At the moment there is not a universally accepted approach to doing as much. Existing approaches include:
None of the approaches described above are totally adequate. The LQS specification pre-dates the URN syntax and there is ambiguity about what constitutes the naming entity and what constitutes the local name. The UMLS approach requires a well-funded central authority which doesn't scale well in a distributed environment. The RDF Schema approach is XML-centric and says nothing about how namespace identifiers are assigned. None of the specifications address revisions in a meaningful way.
The LSID syntax and functional behavior seems to fit many of the baseline requirements of classification ids. This LSID specification mentions classes on page 8 - "An LSID usually represents a piece of data, but it is allowed to have LSIDs representing an(sic) abstract entities or concepts. ... If an LSID represents an abstract entity the LSID resolution service must always resolve an empty result". While this leaves the door open, we need to define a services service that delivers more than an empty result and optional metadata.
The LSID syntax is a good fit with the requirements of biomedical OCTs. The resolution, resolution discovery and assigning service services also provide a good fit if you treat "object" as synonymous with "class" or "concept". The advantage of including OCTs in the LSID scope statement would be a reduction in the number of interfaces and tools that would need to be producted. There are several disadvantages, including the risk of scope creep for the core LSID project, the fact that OCTs will need further specification when it comes to format, classification, and other services. We suggest that these issues be opened for discussion and further refinement.