Re: LSID: What's still needed to make it work within the semantic web? from Sean Martin on 2005-03-23 (public-semweb-lifesci@w3.org from March 2005)

From: Sean Martin <sjmm@us.ibm.com>
Date: Wed, 23 Mar 2005 10:03:05 -0500
To: <public-semweb-lifesci@w3.org>
Message-ID: <OF621DC19B.291F2549-ON85256FCD.0050891D-85256FCD.0052AE26@us.ibm.com>
Apologies for my tardy follow up to this thread but I have been out on 
vacation for a few weeks. In this reply I would like to address Eric 
Neumann?s (EN>) original message of 14 March and at the same time include 
commentary on points raised by those who have already replied to it (Eric 
Jain = EJ>; Jim Myers=JM>; Xiaoshu Wang = XW>). 

EN>We  had some very productive discussions on the value of 
EN>the LSID specification at  the workshop in October, 
EN>and many of us would like to see it reach a  functional 
EN>conclusion. Much of the discussion was around what still 

I certainly agree and hope we can make something of the momentum that 
began to generate there.


EN>needs to be  done with the specification, so that LSID's 
EN>become a beneficial and practical  element of the life 
EN>science community. I would like to suggest those 
EN>interested  in seeing the LSID specification come to 
EN>completion, to participate in this  thread, and try and 
EN>define some critical next steps for its success in being 
EN>adopted by most data sources. 

In my view there are a number items missing from the current LSID spec 
that need to be addressed and taken forward for standardization if its 
usefulness is to be fully realized. Eric has listed some of these and I 
hope to add a couple more in this reply.


EN>To  quickly review, LSID offers both a unique identifier 
EN>model for  authoritative life science data, and a 
EN>mechanism by which they can be  resolved to actual 
EN> (unmutable) data bytes and meta-data (mutable). Some 
EN>lingering quaestions include: 

EN>What metadata accessible through LSID should be 
EN>standardized; this may be more about general info- 
EN>descriptive semantics like  Dublin Core and RSS, than 
EN>biological or chem semantics. A precise way to handle 
EN>versioning, derivation, some other relationship types 
EN>for provenance.

As Eric points out, the current specification provides a mechanism for the 
discovery and retrieval of metadata associated with data named by an LSID 
URI (URN) or metadata associated with an LSID that is conceptual (has 
nothing but metadata). However the spec. says nothing about what format 
that metadata should be in, let alone what semantics a program accessing 
it might expect to discover in the retrieved metadata. 

Certainly my group (who provide an open source implementation of the OMG?s 
LSID standard) have happily settled on RDF as the format we are using for 
our own work and the tools and code we provide make this assumption, but 
it is not standard and there are other reasonable contenders like XMI and 
some of the ISO standards that need to be considered and probably 
accommodated.

We are also creating our own non-standard predicates and ontologies 
describing the relationships between objects listed in the metadata and 
their literal values.  Eric lists a few for areas like versioning, 
derivation and provenance. 

Off the top of the head I could add others to this list, for automated 
functionality like navigation, human readable display (hints to a semantic 
web browser and other software that must traverse the metadata) and useful 
information for the data transport system (like the size and MD5 hash of 
the object to which this metadata applies). Another vital area is the 
relationships describing the various formats and contexts available (this 
is related to versioning). For example information may be held in PDF, 
HTML and ASCII (available formats) or an image may be available in Jpeg 
and Tiff (formats) or in different resolutions of jpegs (contexts) or even 
expressed/rendered using different image rendering algorithms (context). 

Without additional standards, it is impossible to write general software 
to automatically aid the user or programs in finding, displaying or 
reasoning on information in any but the crudest forms. 

We would be happy to work with any interested parties to prototype and 
develop these standards and we could start by offering up what we have 
already had to invent to get the ball rolling.


EN>Are  URN-aware resolvers an acceptable means for data 
EN>retrieval for all members of  the life science 
EN>community? Are there any alternatives that are simpler? 

EN>Guidelines for encoding data for common bioinformatics 
EN>data types in LSID; are we all clear what is data and 
EN>what  is metadata?

It looks like the NCI?s caBIG movement is likely to adopt the LSID for 
providing data identity amongst the participating cancer community. They 
have provided some excellent use cases and will be looking to extend 
future versions of the LSID standard. One thing that they would really 
like to see in future versions of the LSID specification is the addition 
of immutable metadata (as well as mutable). This strikes me as a very good 
idea as it solves a number of problems with implementation and will help 
implementers to more easily decide what data is and what metadata is. 
Incidentally, they also believe that there needs to be a starting set of 
standards for what one might expect in metadata and will be pushing hard 
to see this achieved. 
 

EN>Would this include all kinds of RDF graphs that relate 
EN>to the original data item? 

EN>Do we need best practices on utilizing common 
EN>ontologies such as GO within a data entry? 

Yes, this seems to me to be the logical next step. What can we do to get 
this process under way? 
Is this something that the W3C could lead?

EN>How  to specify Dynamic data (latest version) 
EN>effectively (minimal http calls of  LSIDs) 

A tough one unfortunately, as one of the things that many find useful 
about LSIDs is the fact that the name often represents immutable, byte 
identical data. At the moment those that want to provide dynamic data have 
two options. One is to code the changeable portion of the data as RDF and 
provide it as metadata. The other is to provide an LSID without data which 
represents the changing data. Metadata associated with this LSID lets the 
client know the names of LSIDs for the latest versions of the changing 
data ? theses LSID can be generated on the fly as the metadata is served 
up (perhaps using a timestamp or a version number to differentiate  the 
LSID which has directly associated data from the abstract LSID that only 
has metadata. 

Some think this last solution has a problem as it causes an ?explosion? of 
issued LSIDs, but I disagree. There never is a guarantee that a 
dereferenced LSID will be able to provide the data it names on demand 
either now or in the future. Using an LSID as a name guarantees only that 
the name will be unique i.e. never reused for naming any other bytes).  It 
does _not_ guarantee persistence of the data named nor that the authority 
that issued the LSID will always provide a copy of that data on demand. If 
they can and want to do so that is great, but in many cases it will not be 
practical. So my response to people worrying about this ?explosion? is ?so 
what ? its just a name and manipulating a few bytes in a string is more or 
less free.? 
 

EN>I hope  other members of the LSID specification are able 
EN>to participate on this thread,  to help clarify the 
EN>issues, and identify where most value can be gained. 

Ditto :-)

EJ>The web service stuff that is part of the current 
EJ>specification adds a lot of complexity. This is not to 
EJ>say there wouldn't be any use for the web services 
EJ>approach, but in my opinion it shouldn't be part of the 
EJ>core specification. The availability of a simple, 
EJ>RESTful solution (based on HTTP redirection) would 
EJ>almost certainly improve adoption.

As one of those involved in putting together the specification, I 
respectfully disagree with Eric on this. The solution devised had to take 
in a great many requests for functionality and it was clear quite early on 
that simple http redirection would not be enough. Things considered 
included the provision data and metadata for a single LSID name; that the 
data and metadata potentially be available from multiple sources, that 
multiple protocols be offered to the clients ? together with the 
possibility of retrieving only sections of large data blobs etc. Once you 
start adding this list up it begins to get complex as one looks for a 
simple way to communicate this information to a requesting client. 

Certainly we could have perhaps invented a whole list of extensions to 
HTTP redirect to cover all these cases, but then we would be inventing a 
great deal out of thin air. At the same time we could see that how the 
WSDL standard would cover all the cases we were worried about and the 
software for this was already written. Just to put this all in 
perspective, remember that the retrieval of the WSDL describing the end 
points for retrieving the data and metadata is the only place that the web 
service stuff is actually necessary. The end points listed  there are 
usually plain old HTTP or FTP URL?s and then we are back to the plain old 
web. 

As to your point on adoption, I agree that at first sight it is not as 
simple as one might hope, but consider that providing only the simplest 
functionality leaves us with something that is not too useful either, 
removing the incentive to adoption. I think we are far enough along now 
that the server and client software stacks remove most of the complexity 
for implementers. With luck the balance between usefulness and complexity 
is reasonable. Certainly compared to software stacks & protocols I see 
being invented these days, the one used for the LSID resolution is a very 
modest accumulation of technologies that preceded it.


XW>In MHO, what differs LSID from a simple URN is its 
XW>coupling of name with a protocol.  This sort 
XW>of "resolve" the issues  of Identity crisis in RDF 
XW>because we can ask if a resource is available and in 
XW>what dimensionality.  For instance, if a LSID is used to 
XW>represent a gel,  should it be presented in image (what 
XW>format though) or XML, RDF  etc?

People in caBIG have a number of ideas about doing transforms that take 
LSIDs as one parameter, but this thinking is still in the early stages and 
I am not sure that I understand it yet. If you are interested in this I 
suggest you keep an eye on their identifiers maillist.

One thing to remember is that an individual LSID always represents/names 
metadata OR metadata AND a static binary object OR just a static binary 
object. To do what you want today, one would just code the potentially 
available dimensionality you talk about into an LSID which has just 
metadata. This metadata would contain LSIDs to the various available 
dimensions and of course these would be names of the actual data blobs 
that can be retrieved. Client software would be able take the first LSID 
and use the metadata to present user software with the available 
dimensions from which the correct one can be picked.


XW>2) The distinction of metadata and data is always 
XW>arbiturary.  I think it needs to be clarified. I would 
XW>like to make a suggestion  here. Metadata is the data 
XW>presented  in RDF and Data is  otherwise.
 
It is my view that many will get to this pretty quickly. As I mentioned 
earlier, the caBIG folks want to see a means for distinction between 
dynamic and static metadata added to the standard. Adding this removes the 
only wrinkle that I can currently see preventing this now -- that is for 
those (like us) that have decide to standardize on RDF as the metadata 
format.


XW>3) By the way, what is the status of LSID.  I  think it 
XW>is a great idea and would like to contribute.  But if I 
XW>google "LSID", it leads to I3C. From I3C, it links 
XW>to OMG.  What  worries me is that first, I3C has too 
XW>many broken and that worries  me.  And second if I go to 
XW>OMG, and search for LSID, the result is empty.  :-(

The I3C is defunct as far as I can tell. The LSID standard was established 
by the Life Sciences group at the OMG and the first version is now an 
available standard. You can read it here 
http://www.omg.org/docs/dtc/04-05-01.pdf  There is quite a lot of interest 
in a revision already both to clean up one or two things and to work on 
extensions that incorporate additional features. I will post here when 
discussions start on the mail list about this.

 
JM>The "LSID" name: 
JM>Are life science identifiers  different enough that they 
JM>need to be treated separately? Do we then need a 
JM>physical science identifier, a computer science JM>identifier, etc.?

There is always the temptation to go for the one size fits all approach 
and this was discussed quite often by I3C participants, since it was clear 
that the LSID was a fairly general mechanism. However in the end it was 
decided wiser to limit the scope to the domains understood by the people 
present since it was impossible to say what use cases from other domains 
might add and what features or social contracts their identifiers might 
need. In the same way that it was decided that the URL or HTTP URI did not 
do it for Life Sciences, it was felt that nobody involved could know that 
LSIDs would do it for everyone else either.
 

JM>LSID as a protocol as well  as a name: 
JM>Similar issue, but one that  can also be described as 
JM>death-by-plugins - if everyone who wants to control a 
JM>namespace for identifiers makes a new protocol requiring 
JM>a plug-in... 

More likely in my opinion is that to start with we may get a small number 
of conventions, standards and accompanying software. However this will 
shake out till we are left with the very few that cover enough cases 
between them and are well supported enough by sofytware that every 
community is happy to reuse them. Don't under estimate the difficulty any 
group has in establishing any kind of identifier.. in my own experience it 
is hard work and takes a great deal time and resources to get adoption - 
this alone will likely restrict the numbers.
 
JM>Persistence policy as part  of the name/protocol: 
JM>Is persistence such a unique  and overriding piece of 
JM>metadata that it should be part of the name and/or 
JM>require a separate protocol? Does the name of data 
JM>change when a researcher  decides it is valid and should 
JM>be kept forever? There seem to be problems  analogous to 
JM>the 'don't encode location in the name because it might 
JM>move'  issue.

It is worth noting again here that the issuing of an LSID does not 
necessarily denote persistence of the object. It does offer a guarantee 
that no other object will ever share the same LSID name which helps to 
promote persistence if a data provider thinks this useful or if a third 
party takes a copy of the object and undertakes to persist it for a 
period.


JM>the issues above could  limit growth and lead to 
JM>fragmentation of the community as it raises awareness 
JM>of what globally unique IDs can do and encourages other 
JM>?my community?s ID?  protocols, and/or modifications 
JM>that attempt to get around the issues noted  above. Will 
JM>chemists all adopt LSID simply because some of the 
JM>molecules they  work on are related to biology rather 
JM>than materials science? Will a  pharmaceutical company 
JM>adopt LSID for data with retention schedules? 

There is no doubt that the lack of an agreed globally unique identifier 
was standing in the way of progress in the Life Sciences field. I guess 
the choice was to standby and wait for some neutral cross industry group 
to eventually come up with something that fit the bill for all groups well 
enough to use, or to start an effort to make something that would work 
immediately. Perhaps in time we will get something global that will work 
for everyone one, but in the meantime some amount of progress can be made 
on the things we are really all here to do.

 
JM>3) The non-http URI  approach requires an extra level of 
JM>infrastructure for resolving objects. For  use in 
JM>browsers this requires an additional plug-in. There seem 
JM>to be very few  available; and then only on certain 
JM>browsers. Further I don't think many  realize that 
JM>browsers are perhaps 1/10th of the applications that 
JM>follow links  (e.g. robots, etc. and this is a different 
JM>issue completely. One the DOI /  publishers are 
JM>Iunfortunately finding out at this very  moment).
JM>A Handle-style proxy  mechanism helps a bit here, but it 
JM>is certainly not as clean/clear as  specifying HTTP 
JM>redirect as *the* resolution mechanism.

HTTP to LSID resolution protocol proxies are starting to exist already. 
For example the one at http://lsid.biopathways.org/resolver/ and I believe 
the code for it is available as open source.
 

JM>5) The LSID community has  socially agreed that the use 
JM>of LSID will point top an immutable resource -  the 
JM>thing one points at will be the same 5, 10, n years 
JM>later.  How can this be enforced socially or  J
M>technically? What?s the penalty for reusing an LSID? If 
JM>the LSID, bits to  persist, and the hash are all owned 
JM>by one organization, the bits and hash  could be changed 
JM>together. 

Apart from the social convention (there was always an ambiguity regarding 
URL as URI and persistence that made it easy to 'abuse') being clearly 
defined, one aspect that might help will be in the adoption of metadata 
standards that amongst many other things can describe hashes of named 
objects and prove integrity. Software used to serve and process LSID named 
data will use this information to validate data on transfer and in third 
party caches. A byproduct of this will be an enforcement of the social 
convention that named objects never change since any changes would risk 
seeing the changed data objects marked as invalid by the network layers. 


JM>need to name/expose both the individual versions and the 
JM>'latest' version,  whatever number that currently is, 
JM>which means bit-level persistence will  probably not 
JM>meet all life-science needs, which may lead to 'abuse' 
JM>of LSIDs  with 0-byte data to refer to things with 
JM>dynamics.

It is my opinion that this does not actually qualify as abuse since it 
provides a solution that is both useful and meets the standard in every 
respect. What is required to make it workable are the associated metadata 
ontologies. I would be most interested to read contray opionion and 
reasoning.
 
 

Kindest regards, Sean

--
Sean Martin
IBM Corp.
Received on Wednesday, 23 March 2005 15:03:39 UTC