- From: Sean Martin <sjmm@us.ibm.com>
- Date: Wed, 23 Mar 2005 10:03:05 -0500
- To: <public-semweb-lifesci@w3.org>
- Message-ID: <OF621DC19B.291F2549-ON85256FCD.0050891D-85256FCD.0052AE26@us.ibm.com>
Apologies for my tardy follow up to this thread but I have been out on vacation for a few weeks. In this reply I would like to address Eric Neumann?s (EN>) original message of 14 March and at the same time include commentary on points raised by those who have already replied to it (Eric Jain = EJ>; Jim Myers=JM>; Xiaoshu Wang = XW>). EN>We had some very productive discussions on the value of EN>the LSID specification at the workshop in October, EN>and many of us would like to see it reach a functional EN>conclusion. Much of the discussion was around what still I certainly agree and hope we can make something of the momentum that began to generate there. EN>needs to be done with the specification, so that LSID's EN>become a beneficial and practical element of the life EN>science community. I would like to suggest those EN>interested in seeing the LSID specification come to EN>completion, to participate in this thread, and try and EN>define some critical next steps for its success in being EN>adopted by most data sources. In my view there are a number items missing from the current LSID spec that need to be addressed and taken forward for standardization if its usefulness is to be fully realized. Eric has listed some of these and I hope to add a couple more in this reply. EN>To quickly review, LSID offers both a unique identifier EN>model for authoritative life science data, and a EN>mechanism by which they can be resolved to actual EN> (unmutable) data bytes and meta-data (mutable). Some EN>lingering quaestions include: EN>What metadata accessible through LSID should be EN>standardized; this may be more about general info- EN>descriptive semantics like Dublin Core and RSS, than EN>biological or chem semantics. A precise way to handle EN>versioning, derivation, some other relationship types EN>for provenance. As Eric points out, the current specification provides a mechanism for the discovery and retrieval of metadata associated with data named by an LSID URI (URN) or metadata associated with an LSID that is conceptual (has nothing but metadata). However the spec. says nothing about what format that metadata should be in, let alone what semantics a program accessing it might expect to discover in the retrieved metadata. Certainly my group (who provide an open source implementation of the OMG?s LSID standard) have happily settled on RDF as the format we are using for our own work and the tools and code we provide make this assumption, but it is not standard and there are other reasonable contenders like XMI and some of the ISO standards that need to be considered and probably accommodated. We are also creating our own non-standard predicates and ontologies describing the relationships between objects listed in the metadata and their literal values. Eric lists a few for areas like versioning, derivation and provenance. Off the top of the head I could add others to this list, for automated functionality like navigation, human readable display (hints to a semantic web browser and other software that must traverse the metadata) and useful information for the data transport system (like the size and MD5 hash of the object to which this metadata applies). Another vital area is the relationships describing the various formats and contexts available (this is related to versioning). For example information may be held in PDF, HTML and ASCII (available formats) or an image may be available in Jpeg and Tiff (formats) or in different resolutions of jpegs (contexts) or even expressed/rendered using different image rendering algorithms (context). Without additional standards, it is impossible to write general software to automatically aid the user or programs in finding, displaying or reasoning on information in any but the crudest forms. We would be happy to work with any interested parties to prototype and develop these standards and we could start by offering up what we have already had to invent to get the ball rolling. EN>Are URN-aware resolvers an acceptable means for data EN>retrieval for all members of the life science EN>community? Are there any alternatives that are simpler? EN>Guidelines for encoding data for common bioinformatics EN>data types in LSID; are we all clear what is data and EN>what is metadata? It looks like the NCI?s caBIG movement is likely to adopt the LSID for providing data identity amongst the participating cancer community. They have provided some excellent use cases and will be looking to extend future versions of the LSID standard. One thing that they would really like to see in future versions of the LSID specification is the addition of immutable metadata (as well as mutable). This strikes me as a very good idea as it solves a number of problems with implementation and will help implementers to more easily decide what data is and what metadata is. Incidentally, they also believe that there needs to be a starting set of standards for what one might expect in metadata and will be pushing hard to see this achieved. EN>Would this include all kinds of RDF graphs that relate EN>to the original data item? EN>Do we need best practices on utilizing common EN>ontologies such as GO within a data entry? Yes, this seems to me to be the logical next step. What can we do to get this process under way? Is this something that the W3C could lead? EN>How to specify Dynamic data (latest version) EN>effectively (minimal http calls of LSIDs) A tough one unfortunately, as one of the things that many find useful about LSIDs is the fact that the name often represents immutable, byte identical data. At the moment those that want to provide dynamic data have two options. One is to code the changeable portion of the data as RDF and provide it as metadata. The other is to provide an LSID without data which represents the changing data. Metadata associated with this LSID lets the client know the names of LSIDs for the latest versions of the changing data ? theses LSID can be generated on the fly as the metadata is served up (perhaps using a timestamp or a version number to differentiate the LSID which has directly associated data from the abstract LSID that only has metadata. Some think this last solution has a problem as it causes an ?explosion? of issued LSIDs, but I disagree. There never is a guarantee that a dereferenced LSID will be able to provide the data it names on demand either now or in the future. Using an LSID as a name guarantees only that the name will be unique i.e. never reused for naming any other bytes). It does _not_ guarantee persistence of the data named nor that the authority that issued the LSID will always provide a copy of that data on demand. If they can and want to do so that is great, but in many cases it will not be practical. So my response to people worrying about this ?explosion? is ?so what ? its just a name and manipulating a few bytes in a string is more or less free.? EN>I hope other members of the LSID specification are able EN>to participate on this thread, to help clarify the EN>issues, and identify where most value can be gained. Ditto :-) EJ>The web service stuff that is part of the current EJ>specification adds a lot of complexity. This is not to EJ>say there wouldn't be any use for the web services EJ>approach, but in my opinion it shouldn't be part of the EJ>core specification. The availability of a simple, EJ>RESTful solution (based on HTTP redirection) would EJ>almost certainly improve adoption. As one of those involved in putting together the specification, I respectfully disagree with Eric on this. The solution devised had to take in a great many requests for functionality and it was clear quite early on that simple http redirection would not be enough. Things considered included the provision data and metadata for a single LSID name; that the data and metadata potentially be available from multiple sources, that multiple protocols be offered to the clients ? together with the possibility of retrieving only sections of large data blobs etc. Once you start adding this list up it begins to get complex as one looks for a simple way to communicate this information to a requesting client. Certainly we could have perhaps invented a whole list of extensions to HTTP redirect to cover all these cases, but then we would be inventing a great deal out of thin air. At the same time we could see that how the WSDL standard would cover all the cases we were worried about and the software for this was already written. Just to put this all in perspective, remember that the retrieval of the WSDL describing the end points for retrieving the data and metadata is the only place that the web service stuff is actually necessary. The end points listed there are usually plain old HTTP or FTP URL?s and then we are back to the plain old web. As to your point on adoption, I agree that at first sight it is not as simple as one might hope, but consider that providing only the simplest functionality leaves us with something that is not too useful either, removing the incentive to adoption. I think we are far enough along now that the server and client software stacks remove most of the complexity for implementers. With luck the balance between usefulness and complexity is reasonable. Certainly compared to software stacks & protocols I see being invented these days, the one used for the LSID resolution is a very modest accumulation of technologies that preceded it. XW>In MHO, what differs LSID from a simple URN is its XW>coupling of name with a protocol. This sort XW>of "resolve" the issues of Identity crisis in RDF XW>because we can ask if a resource is available and in XW>what dimensionality. For instance, if a LSID is used to XW>represent a gel, should it be presented in image (what XW>format though) or XML, RDF etc? People in caBIG have a number of ideas about doing transforms that take LSIDs as one parameter, but this thinking is still in the early stages and I am not sure that I understand it yet. If you are interested in this I suggest you keep an eye on their identifiers maillist. One thing to remember is that an individual LSID always represents/names metadata OR metadata AND a static binary object OR just a static binary object. To do what you want today, one would just code the potentially available dimensionality you talk about into an LSID which has just metadata. This metadata would contain LSIDs to the various available dimensions and of course these would be names of the actual data blobs that can be retrieved. Client software would be able take the first LSID and use the metadata to present user software with the available dimensions from which the correct one can be picked. XW>2) The distinction of metadata and data is always XW>arbiturary. I think it needs to be clarified. I would XW>like to make a suggestion here. Metadata is the data XW>presented in RDF and Data is otherwise. It is my view that many will get to this pretty quickly. As I mentioned earlier, the caBIG folks want to see a means for distinction between dynamic and static metadata added to the standard. Adding this removes the only wrinkle that I can currently see preventing this now -- that is for those (like us) that have decide to standardize on RDF as the metadata format. XW>3) By the way, what is the status of LSID. I think it XW>is a great idea and would like to contribute. But if I XW>google "LSID", it leads to I3C. From I3C, it links XW>to OMG. What worries me is that first, I3C has too XW>many broken and that worries me. And second if I go to XW>OMG, and search for LSID, the result is empty. :-( The I3C is defunct as far as I can tell. The LSID standard was established by the Life Sciences group at the OMG and the first version is now an available standard. You can read it here http://www.omg.org/docs/dtc/04-05-01.pdf There is quite a lot of interest in a revision already both to clean up one or two things and to work on extensions that incorporate additional features. I will post here when discussions start on the mail list about this. JM>The "LSID" name: JM>Are life science identifiers different enough that they JM>need to be treated separately? Do we then need a JM>physical science identifier, a computer science JM>identifier, etc.? There is always the temptation to go for the one size fits all approach and this was discussed quite often by I3C participants, since it was clear that the LSID was a fairly general mechanism. However in the end it was decided wiser to limit the scope to the domains understood by the people present since it was impossible to say what use cases from other domains might add and what features or social contracts their identifiers might need. In the same way that it was decided that the URL or HTTP URI did not do it for Life Sciences, it was felt that nobody involved could know that LSIDs would do it for everyone else either. JM>LSID as a protocol as well as a name: JM>Similar issue, but one that can also be described as JM>death-by-plugins - if everyone who wants to control a JM>namespace for identifiers makes a new protocol requiring JM>a plug-in... More likely in my opinion is that to start with we may get a small number of conventions, standards and accompanying software. However this will shake out till we are left with the very few that cover enough cases between them and are well supported enough by sofytware that every community is happy to reuse them. Don't under estimate the difficulty any group has in establishing any kind of identifier.. in my own experience it is hard work and takes a great deal time and resources to get adoption - this alone will likely restrict the numbers. JM>Persistence policy as part of the name/protocol: JM>Is persistence such a unique and overriding piece of JM>metadata that it should be part of the name and/or JM>require a separate protocol? Does the name of data JM>change when a researcher decides it is valid and should JM>be kept forever? There seem to be problems analogous to JM>the 'don't encode location in the name because it might JM>move' issue. It is worth noting again here that the issuing of an LSID does not necessarily denote persistence of the object. It does offer a guarantee that no other object will ever share the same LSID name which helps to promote persistence if a data provider thinks this useful or if a third party takes a copy of the object and undertakes to persist it for a period. JM>the issues above could limit growth and lead to JM>fragmentation of the community as it raises awareness JM>of what globally unique IDs can do and encourages other JM>?my community?s ID? protocols, and/or modifications JM>that attempt to get around the issues noted above. Will JM>chemists all adopt LSID simply because some of the JM>molecules they work on are related to biology rather JM>than materials science? Will a pharmaceutical company JM>adopt LSID for data with retention schedules? There is no doubt that the lack of an agreed globally unique identifier was standing in the way of progress in the Life Sciences field. I guess the choice was to standby and wait for some neutral cross industry group to eventually come up with something that fit the bill for all groups well enough to use, or to start an effort to make something that would work immediately. Perhaps in time we will get something global that will work for everyone one, but in the meantime some amount of progress can be made on the things we are really all here to do. JM>3) The non-http URI approach requires an extra level of JM>infrastructure for resolving objects. For use in JM>browsers this requires an additional plug-in. There seem JM>to be very few available; and then only on certain JM>browsers. Further I don't think many realize that JM>browsers are perhaps 1/10th of the applications that JM>follow links (e.g. robots, etc. and this is a different JM>issue completely. One the DOI / publishers are JM>Iunfortunately finding out at this very moment). JM>A Handle-style proxy mechanism helps a bit here, but it JM>is certainly not as clean/clear as specifying HTTP JM>redirect as *the* resolution mechanism. HTTP to LSID resolution protocol proxies are starting to exist already. For example the one at http://lsid.biopathways.org/resolver/ and I believe the code for it is available as open source. JM>5) The LSID community has socially agreed that the use JM>of LSID will point top an immutable resource - the JM>thing one points at will be the same 5, 10, n years JM>later. How can this be enforced socially or J M>technically? What?s the penalty for reusing an LSID? If JM>the LSID, bits to persist, and the hash are all owned JM>by one organization, the bits and hash could be changed JM>together. Apart from the social convention (there was always an ambiguity regarding URL as URI and persistence that made it easy to 'abuse') being clearly defined, one aspect that might help will be in the adoption of metadata standards that amongst many other things can describe hashes of named objects and prove integrity. Software used to serve and process LSID named data will use this information to validate data on transfer and in third party caches. A byproduct of this will be an enforcement of the social convention that named objects never change since any changes would risk seeing the changed data objects marked as invalid by the network layers. JM>need to name/expose both the individual versions and the JM>'latest' version, whatever number that currently is, JM>which means bit-level persistence will probably not JM>meet all life-science needs, which may lead to 'abuse' JM>of LSIDs with 0-byte data to refer to things with JM>dynamics. It is my opinion that this does not actually qualify as abuse since it provides a solution that is both useful and meets the standard in every respect. What is required to make it workable are the associated metadata ontologies. I would be most interested to read contray opionion and reasoning. Kindest regards, Sean -- Sean Martin IBM Corp.
Received on Wednesday, 23 March 2005 15:03:39 UTC