Re: Unique ID options from Chimezie Ogbuji on 2007-01-29 (public-semweb-lifesci@w3.org from January 2007)

From: Chimezie Ogbuji <ogbujic@bio.ri.ccf.org>
Date: Mon, 29 Jan 2007 10:20:09 -0500 (EST)
To: samwald@gmx.at
cc: "Forsberg, Kerstin L" <Kerstin.L.Forsberg@astrazeneca.com>, public-semweb-lifesci@w3.org
Message-ID: <Pine.GSO.4.60.0701290916120.29468@joplin.bio.ri.ccf.org>

On Mon, 29 Jan 2007 samwald@gmx.at wrote:
>> How to uniquely identify such information resources, i.e. the
>> recordings of clinical acts of observations ?
>> Spontaneously we assigned concatenated identifiers. e.g.
>> http://clinic.com/study/T2271/subject/S83221/observation/O6561
>> Is this current best practice for unique identification schemas in
>> the HCLS community?
>
> It is a common practice, and surely not a bad one, but I don't think it could be called a 'best practice' either. Some purists are discouraging it, put it does have many practical advantages. While the URIs of RDF graphs are not intended to be read by humans, they are still often visible to the user, and a readable URI can be helpful in some occasions. It also makes it much easier to develop and debug Semantic Web applications or datasets this way.

Well, I'm inclined to ask why URI's should be explicitely assigned? Isn't 
the intention to say something of the effect:

   Some recording of a clinical observation was made and here is the information about it

As much trouble as BNodes (existential variables) cause RDF applications 
that use them, I think they cover this scenario pretty well.  I think 
explicit URI's should only be used if there is an authoritative
  / centralized identification scheme and there are good reasons (beyond 
maintaining uniqueness) to maintain them over the life of the record.  If 
there is no real precedent or reason to explicitely generate URIs (other 
than collision avoidance) then it is best left up to a skolem function [1] 
to enforce this and allow an existential variable to capture the intent.

Consider that most current medical / patient records are time-oriented 
mostly and rely on a time scale for disambiguation, so there wouldn't be 
any need to say:

    [ a cpr:patient-record ]
      dol:part <urn:uuid:medical-record-db-20/26655>.
      <urn:uuid:medical-record-db-20/26655> a cpr:clinical-description.

When you can say

    [ a cpr:patient-record ]
      dol:part [ a cpr:clinical-description ].

And allow the underlying RDF store to manage the (temporary) identifiers 
you shouldn't need to manage yourself.

[1] http://en.wikipedia.org/wiki/Skolemization

>
> Personally, so far I have mainly tried to use an algorithm to genereate new URIs based on information in the datasource, like in the example you gave. However, depending on the nature of the original data, this can easily lead to conflicts an non-unique URIs. It is hard to circumvent this problem with some datasets.
> When you are converting a datasource that already has an identifier (e.g. PMIDs in the case of Pubmed abstracts), these are of course a better choice.
> Personally, I do not think that URIs for non-information-resources need to be resolvable through HTTP or a similar mechanism, so I circumvent many problems associated with that.
>
> Another approach that I have never tried, but I think that is worth thinking about, is to simply rely on large random numbers. Of course, there are use-cases where even a miniscule possibilite of non-uniquness is unacceptable, like in clinical patient records, but in a lot of use cases it is acceptable.

At this point aren't you essentially taken on the burden of skolemization?

Chimezie Ogbuji
Lead Systems Analyst
Thoracic and Cardiovascular Surgery
Cleveland Clinic Foundation
9500 Euclid Avenue/ W26
Cleveland, Ohio 44195
Office: (216)444-8593
ogbujic@ccf.org

Received on Monday, 29 January 2007 15:31:21 UTC