- From: Myers, James D <jim.myers@pnl.gov>
- Date: Tue, 15 Mar 2005 20:36:12 -0800
- To: Eric.Neumann@sanofi-aventis.com, public-semweb-lifesci@w3.org
- Message-id: <67AF35FA07A89948AFA88E64793A1DB3D7ED10@pnlmse35.pnl.gov>
Eric, I had some discussions with Eric Miller, Bertram Ludaescher and others after the meeting and, in the text below, I've tried to summarize some of that discussion. While I want to acknowledge their contributions and they may recognize some of their words below, I've made enough changes to summarize multiple emails, recall conversations, and add new material and references, that it shouldn't be taken as properly representing anyone's position or a consensus of any sort (not sure our discussion(s) ever reached conclusion...). Never-the-less, I thought it would be useful to add it into this discussion as background information. Jim James D. Myers Chief Scientist, Computational Sciences and Mathematics Division Computational and Information Sciences Directorate Pacific Northwest National Laboratory Phone: 610-355-0994 Fax: 208-474-4616 Jim.Myers@pnl.gov <mailto:Jim.Myers@pnl.gov> LSID discussions: There are three main sections below - issues, general comments, and some potential alternatives/directions. Issues: The "LSID" name: Are life science identifiers different enough that they need to be treated separately? Do we then need a physical science identifier, a computer science identifier, etc.? LSID as a protocol as well as a name: Similar issue, but one that can also be described as death-by-plugins - if everyone who wants to control a namespace for identifiers makes a new protocol requiring a plug-in... Persistence policy as part of the name/protocol: Is persistence such a unique and overriding piece of metadata that it should be part of the name and/or require a separate protocol? Does the name of data change when a researcher decides it is valid and should be kept forever? There seem to be problems analogous to the 'don't encode location in the name because it might move' issue. Persistence policy as a binary option: There are many shades of grey in persistence - How long is the guarantee? What happens to data with a 5, 10, or 50 year retention schedule after which is to be deleted? Is access also guaranteed, or just unique naming? Is the guarantee best effort? Does it apply to bits or an 'equivalent' (by whose definition?) item, e.g. the PDF copy of an obsolete MS Word 1.0 document? Is persistence policy handled better as metadata defined by a schema(s)? Metadata retrieval as part of a persistent identifier protocol: Is metadata unique to persistent resources? Is there a reason to balkanize metadata access by tying the mechanism to a type of resource? Or should the semantic web provide a mechanism allowing metadata association with 'any' resource, persistent or not, via a standard mechanism? General Commentary: 1) A model for naming resources that a community can agree on is a good / powerful thing; LSID has defined such a model and has a large growing community behind it. Yes, but... the issues above could limit growth and lead to fragmentation of the community as it raises awareness of what globally unique IDs can do and encourages other "my community's ID" protocols, and/or modifications that attempt to get around the issues noted above. Will chemists all adopt LSID simply because some of the molecules they work on are related to biology rather than materials science? Will a pharmaceutical company adopt LSID for data with retention schedules? 2) Persistence identification and the ability to persistently resolve names are not artifacts of any technology - they are an organization / community investment. It is unclear what investment the LS community has at this point for supporting resolution services (DNS, HTTP, or other). Should expectations of persistence shouldn't be managed by naming convention rather than protocol - http://persistent.my.org/ addresses or the use of Handle-style/meaning free URLs (e.g. http://456.10123.name.org/myname - see below). The convention of "www.* <outbind://579/www.*> " for web servers seems to have worked very well for conveying that expectation that these machines support HTTP. 3) The non-http URI approach requires an extra level of infrastructure for resolving objects. For use in browsers this requires an additional plug-in. There seem to be very few available; and then only on certain browsers. Further I don't think many realize that browsers are perhaps 1/10th of the applications that follow links (e.g. robots, etc. and this is a different issue completely. One the DOI / publishers are unfortunately finding out at this very moment). A Handle-style proxy mechanism helps a bit here, but it is certainly not as clean/clear as specifying HTTP redirect as *the* resolution mechanism. 4) non-http URIs put barriers up for adoption to other communities. There are reasons (sometimes) to do this, but has this been explored for LSID and the implications understood? And since science is becoming more interdisciplinary, the protocol really needs to be science-wide or pervasive even if namespaces are controlled by smaller orgs. 5) The LSID community has socially agreed that the use of LSID will point top an immutable resource - the thing one points at will be the same 5, 10, n years later. How can this be enforced socially or technically? What's the penalty for reusing an LSID? If the LSID, bits to persist, and the hash are all owned by one organization, the bits and hash could be changed together. This requirement is science-wide - it's been the argument against allowing any URLs as references in the literature, and everyone is moving to treat data in the same way. Life science is ahead in the number of individual data items to be tracked and in how large the community is that needs to persistently refer to things, hence they have the biggest problem right now, but everyone in science (and beyond) has it at some level. Socially, it isn't clear that LSID provides any more leverage than, for example, a naming convention as in #2. Technically, without a means to make name/hash pairs non-reputable (e.g. by registering them with a neutral third party or using a digital signature), LSID cannot detect reuse of names. 6) It is unclear how best to use LSID; more specifically *when* to use it and when *not* to. There was talk at the meeting of using these for documents, reports, concepts declared on the Semantic Web, etc. There's a slippery slope here and it will be hard to have a clear convention. I may want to name my raw data, the average of my raw data, a calibrated version of my data, my latest/best data, a graph of my data, the paper about the data, etc. From various discussions of versioning, it is clear that there are use cases that need to name/expose both the individual versions and the 'latest' version, whatever number that currently is, which means bit-level persistence will probably not meet all life-science needs, which may lead to 'abuse' of LSIDs with 0-byte data to refer to things with dynamics. 7) Is LSID bad? No. The level of adoption of LSID is impressive (though it isn't clear how much of that is simply attaching lsids for future use versus actively producing and consuming them). While the discussions at the Semantic Web for Life Sciences workshop was negative at times, one should not criticize LSIDs without acknowledging that they are a step forward and are definitely enabling and educating the community. However, the semantic web and the life sciences will need more general mechanisms for naming and associating metadata with resources, and a means to provide more detailed persistence information; promoting LSIDs as a short-term solution may not be the best option if progress on these issues can be made quickly. Potential Alternatives: Naming: The Handle System - similar to LSID with its own protocol and resolution mechanism. Used in DOIs. Has a proxy mechanism so no plug-in is required - http://hdl.handle.net/<some-handle <http://hdl.handle.net/%3csome-handle> > will invoke a resolver service and redirect you to the resource. The Handle System has its own protocol with its own metadata methods and thus shares those issues with LSIDs, its proxy, and the fact that the protocol and namespaces are separate (i.e. the lsid community could organize part of handle space for themselves) seem like advantages over LSID. Handles are also being proposed as part of the Grid naming mechanism (see http://www.globusworld.org/program/abstract.php?id=33, https://forge.gridforum.org/projects/ogsa-wg/document/draft-charter-nami ng-wg/en). Persistent URLs - standard URLs maintained by authorities that use HTTP Redirect to provide access to resources. The PURL website has extensive documentations and FAQ information: http://purl.oclc.org <http://purl.oclc.org/> Naming convention only - Use standard URLs and DNS resolution. Resolvers/authorities could be identified via a convention such as addresses starting with "uid", e.g. http://uid.my.org/. If URIs used as persistent names are "meaning-free" addresses , e.g. http://456.10123.name.org/myresourcename <http://456.10123.name.org/myresourcename> , it would be easy to transfer resolution duties between organizations, i.e. to reassign 10123.name.org from my organization to yours if my org doesn't want to maintain things anymore. Use HTTP redirects as a resolution mechanism. Metadata: Protocols such as LSID and The Handle System have their own extensible metadata mechanisms. For URL-based options, there are proposals for ways to add metadata capabilities to URLs: The Nokia MPUT/MGET/MDELETE methods proposed as part of their URI Query Agent Model (URIQA) (http://sw.nokia.com/uriqa/URIQA.html). GET/POST mechanisms for requesting/setting metadata about third-party resources are also defined. URIQA defines the concept of a Concise Bounded Description of a resource (http://swdev.nokia.com/uriqa/CBD.html) as the set of RDF statements accessible via these methods. Clark et. al. propose an alternate mechanism using XPointer and HTTP in "A Semantic Web Resource Protocol:Xpointer and HTTP" (http://www.mindswap.org/papers/swrp-iswc04.pdf). Persistence Policy: With any of these naming and metadata combinations, persistence could be treated in the same way as other metadata - statements about persistence policy could be standardized and accessed via the same mechanism used to discover authors, type, creation date, provenance, etc. Persistence policy could be a simple (binary) or complex (retention schedules, definition of identity/equivalence used, ...) as desired by various sub-communities. Additional URLs: Handles: www.handle.net <outbind://579/www.handle.net> Tim B-L musings on names from '96: http://www.w3.org/DesignIssues/NameMyth.html Meaning-free DNS names: http://www.frankston.com/public/essays/DNSSafeHaven.asp Comparison of Handles and PURLs (by a Handle advocate?): http://web.mit.edu/handle/www/purl-eval.html LSID spec: http://www.omg.org/docs/dtc/04-05-01.pdf "Persistent Indentification (sic): A Key Component of an E-Government Infrastructure, Updated July 26, 2004" - discusses PURLS and Handles and other alternatives: http://cendi.dtic.mil/publications/04-2persist_id.html -----Original Message----- From: public-semweb-lifesci-request@w3.org [mailto:public-semweb-lifesci-request@w3.org] On Behalf Of Eric.Neumann@sanofi-aventis.com Sent: Monday, March 14, 2005 6:29 PM To: public-semweb-lifesci@w3.org Subject: LSID: What's still needed to make it work within the semantic web? We had some very productive discussions on the value of the LSID specification at the workshop in October, and many of us would like to see it reach a functional conclusion. Much of the discussion was around what still needs to be done with the specification, so that LSID's become a beneficial and practical element of the life science community. I would like to suggest those interested in seeing the LSID specification come to completion, to participate in this thread, and try and define some critical next steps for its success in being adopted by most data sources. I would also recommend people to re-read the 3 position papers on LSID from last October's workshop: http://www.w3.org/2004/07/swls-agenda.html . Steve Chervitz's paper from Affymetrix has some very useful insights in it that I think many would appreciate. To quickly review, LSID offers both a unique identifier model for authoritative life science data, and a mechanism by which they can be resolved to actual (unmutable) data bytes and meta-data (mutable). Some lingering quaestions include: * What metadata accessible through LSID should be standardized; this may be more about general info-descriptive semantics like Dublin Core and RSS, than biological or chem semantics. * A precise way to handle versioning, derivation, some other relationship types for provenance * Are URN-aware resolvers an acceptable means for data retrieval for all members of the life science community? Are there any alternatives that are simpler? * Guidelines for encoding data for common bioinformatics data types in LSID; are we all clear what is data and what is metadata? Would this include all kinds of RDF graphs that relate to the original data item? Do we need best practices on utilizing common ontologies such as GO within a data entry? * How to specify Dynamic data (latest version) effectively (minimal http calls of LSIDs) I hope other members of the LSID specification are able to participate on this thread, to help clarify the issues, and identify where most value can be gained. Eric
Received on Wednesday, 16 March 2005 04:36:48 UTC