Re: [BioRDF] All about the LSID URI/URN

Adding to the identifier discussion - here are my 
notes summarizing discussions at/after the 
initial Semantic Web for Life Science workshop 
(primarily between myself, Sean Martin, and Eric 
Miller). Some of the points have come up in other 
posts and I haven't tried to go through and 
update this, but I thought it might have some utility still.

My bias, which comes through some in the notes, 
has come from working with the webDAV protocol 
(which allows setting text/XML metadata on any 
managed URL) which has shown me how powerful a 
'universal' mechanism for associating metadata 
with 'any' URL can be. webDAV has many 
limitations but a URL identifier plus URIQA-style 
extension to HTTP (get/set of RDF metadata ala 
webDAV, referenced below) seems like it would 
overcome them. One could then associate metadata 
about persistence policy with the ID and treat 
persistent and transient objects uniformly in 
terms of their other properties (e.g. I use the 
same protocol to get the provenance of an 
intermediate data set in an ongoing computation 
as I would for data that will be kept for a 5 
years and for that which will be maintained forever).

Side note: The discussion on the list is great - please keep it up!

   Jim

Issues:

The "LSID" name:

Are life science identifiers different enough 
that they need to be treated separately? Do we 
then need a physical science identifier, a computer science identifier, etc.?


LSID as a protocol as well as a name:

Similar issue, but one that can also be described 
as death-by-plugins - if everyone who wants to 
control a namespace for identifiers makes a new 
protocol requiring a plug-in...


Persistence policy as part of the name/protocol:

Is persistence such a unique and overriding piece 
of metadata that it should be part of the name 
and/or require a separate protocol? Does the name 
of data change when a researcher decides it is 
valid and should be kept forever? There seem to 
be problems analogous to the 'don't encode 
location in the name because it might move' issue.


Persistence policy as a binary option:

There are many shades of grey in persistence - 
How long is the guarantee? What happens to data 
with a 5, 10, or 50 year retention schedule after 
which is to be deleted? Is access also guaranteed 
or just unique naming? Is the guarantee best 
effort? Does it apply to bits or an ‘equivalent’ 
(by whose definition) item, e.g. the PDF copy of 
an obsolete MS Word 1.0 document? Is persistence 
policy handled better as metadata defined by a schema(s)?

Metadata retrieval as part of a persistent identifier protocol:

Is metadata unique to persistent resources? Is 
there a reason to balkanize metadata access by 
tying the mechanism to a type of resource? Or 
should the semantic web provide a mechanism 
allowing metadata association with ‘any’ 
resource, persistent or not, via a standard mechanism?


General Commentary:


1) A model for naming resources that a community 
can agree on is a good / powerful thing; LSID has 
defined such a model and has a large growing community behind it.

Yes, but…
the issues above could limit growth and lead to 
fragmentation of the community as it raises 
awareness of what globally unique IDs can do and 
encourages other “my community’s ID” protocols, 
and/or modifications that attempt to get around 
the issues noted above. Will chemists all adopt 
LSID simply because some of the molecules they 
work on are related to biology rather than 
materials science? Will a pharmaceutical company 
adopt LSID for data with retention schedules?


2) Persistence identification and the ability to 
persistently resolve names are not artifacts of 
any technology – they are an organization / 
community investment. It is unclear what 
investment the LS community has at this point for 
supporting resolution services (DNS, HTTP, or other).

Should expectations of persistence shouldn't be 
managed by naming convention rather than protocol 
– http://persistent.my.org/ addresses or the use 
of Handle-style/meaning free URLs (e.g. 
<http://456.10123.name.org/myname>http://456.10123.name.org/myname 
- see below). The convention of 
"<file:///www.*>www.*" for web servers seems to 
have worked very well for conveying that 
expectation that these machines support HTTP.


3) The non-http URI approach requires an extra 
level of infrastructure for resolving objects. 
For use in browsers this requires an additional 
plug-in. There seem to be very few available; and 
then only on certain browsers. Further I don't 
think many realize that browsers are perhaps 
1/10th of the applications that follow links 
(e.g. robots, etc. and this is a different issue 
completely. One the DOI / publishers are 
unfortunately finding out at this very moment).

A Handle-style proxy mechanism helps a bit here, 
but it is certainly not as clean/clear as 
specifying HTTP redirect as *the* resolution mechanism.


4) non-http URIs put barriers up for adoption to 
other communities. There are reasons (sometimes) 
to do this, but has this been explored for LSID 
and the implications understood?

And since science is becoming more 
interdisciplinary, the protocol really needs to 
be science-wide or pervasive even if namespaces are controlled by smaller orgs.

5) The LSID community has socially agreed that 
the use of LSID will point top an immutable 
resource - the thing one points at will be the 
same 5, 10, n years later.  How can this be 
enforced socially or technically? What’s the 
penalty for reusing an LSID? If the LSID, bits to 
persist, and the hash are all owned by one 
organization, the bits and hash could be changed together.

This requirement is science-wide - it's been the 
argument against allowing any URLs as references 
in the literature, and everyone is moving to 
treat data in the same way. Life science is ahead 
in the number of individual data items to be 
tracked and in how large the community is that 
needs to persistently refer to things, hence they 
have the biggest problem right now, but everyone 
in science (and beyond) has it at some level. 
Socially, it isn’t clear that LSID provides any 
more leverage than, for example, a naming 
convention as in #2. Technically, without a means 
to make name/hash pairs non-reputable (e.g. by 
registering them with a neutral third party or 
using a digital signature), LSID cannot detect reuse of names.

6) It is unclear how best to use LSID; more 
specifically *when* to use it and when *not* to. 
There was talk at the meeting of using these for 
documents, reports, concepts declared on the Semantic Web, etc.

There's a slippery slope here and it will be hard 
to have a clear convention. I may want to name my 
raw data, the average of my raw data, a 
calibrated version of my data, my latest/best 
data, a graph of my data, the paper about the 
data, etc. From various discussions of 
versioning, it is clear that there are use cases 
that need to name/expose both the individual 
versions and the 'latest' version, whatever 
number that currently is, which means bit-level 
persistence will probably not meet all 
life-science needs, which may lead to 'abuse' of 
LSIDs with 0-byte data to refer to things with dynamics.

7) Is LSID bad?

No. The level of adoption of LSID is impressive 
(though it isn't clear how much of that is simply 
attaching lsids for future use versus actively 
producing and consuming them). While the 
discussions at the Semantic Web for Life Sciences 
workshop was negative at times, one should not 
criticize LSIDs without acknowledging that they 
are a step forward and are definitely enabling 
and educating the community. However, the 
semantic web and the life sciences will need more 
general mechanisms for naming and associating 
metadata with resources, and a means to provide 
more detailed persistence information; promoting 
LSIDs as a short-term solution may not be the 
best option if progress on these issues can be made quickly.


Potential Alternatives:

Naming:

The Handle System – similar to LSID with its own 
protocol and resolution mechanism. Used in DOIs. 
Has a proxy mechanism so no plug-in is required - 
<http://hdl.handle.net/%3csome-handle>http://hdl.handle.net/<some-handle> 
will invoke a resolver service and redirect you 
to the resource. The Handle System has its own 
protocol with its own metadata methods and thus 
shares those issues with LSIDs, its proxy, and 
the fact that the protocol and namespaces are 
separate (i.e. the lsid community could organize 
part of handle space for themselves) seem like 
advantages over LSID. Handles are also being 
proposed as part of the Grid naming mechanism 
(see 
<http://www.globusworld.org/program/abstract.php?id=33>http://www.globusworld.org/program/abstract.php?id=33, 
<https://forge.gridforum.org/projects/ogsa-wg/document/draft-charter-naming-wg/en>https://forge.gridforum.org/projects/ogsa-wg/document/draft-charter-naming-wg/en).


Persistent URLs – standard URLs maintained by 
authorities that use HTTP Redirect to provide 
access to resources. The PURL website has 
extensive documentations and FAQ information: 
<http://purl.oclc.org/>http://purl.oclc.org

Naming convention only - Use standard URLs and 
DNS resolution. Resolvers/authorities could be 
identified via a convention such as addresses 
starting with “uid”, e.g. 
<http://uid.my.org/>http://uid.my.org/. If URIs 
used as persistent names are “meaning-free” 
addresses , e.g. 
<http://456.10123.name.org/myresourcename>http://456.10123.name.org/myresourcename, 
it would be easy to transfer resolution duties 
between organizations, i.e. to reassign 
10123.name.org from my organization to yours if 
my org doesn’t want to maintain things anymore. 
Use redirects as a resolution mechanism.

Metadata:

Protocols such as LSID and The Handle System have 
their own extensible metadata mechanisms. For 
URL-based options, there are proposals for ways 
to add metadata capabilities to URLs:

The Nokia MPUT/MGET/MDELETE methods proposed as 
part of their URI Query Agent Model (URIQA) 
(<http://sw.nokia.com/uriqa/URIQA.html>http://sw.nokia.com/uriqa/URIQA.html). 
URIQA defines the concept of a Concise Bounded 
Description of a resource as the set of RDF 
statements accessible via these methods.

Clark et. al. propose an alternate mechanism 
using XPointer and HTTP in “A Semantic Web 
Resource Protocol:Xpointer and HTTP” 
(<http://www.mindswap.org/papers/swrp-iswc04.pdf>http://www.mindswap.org/papers/swrp-iswc04.pdf).


Persistence Policy:

With any of these naming and metadata 
combinations, persistence could be treated in the 
same way as other metadata – statements about 
persistence policy could be standardized and 
accessed via the same mechanism used to discover 
authors, type, creation date, etc.


Additional URLs:
Handles: <file:///www.handle.net>www.handle.net
Tim B-L musings on names from '96: 
<http://www.w3.org/DesignIssues/NameMyth.html>http://www.w3.org/DesignIssues/NameMyth.html
Meaning-free DNS names: 
<http://www.frankston.com/public/essays/DNSSafeHaven.asp>http://www.frankston.com/public/essays/DNSSafeHaven.asp
Comparison of Handles and PURLs (by a Handle 
advocate?): 
<http://web.mit.edu/handle/www/purl-eval.html>http://web.mit.edu/handle/www/purl-eval.html
LSID spec: 
<http://www.omg.org/docs/dtc/04-05-01.pdf>http://www.omg.org/docs/dtc/04-05-01.pdf

“Persistent Indentification (sic): A Key Component of an
E-Government Infrastructure, Updated July 26, 
2004” – discusses PURLS and Handles and other 
alternatives: 
<http://cendi.dtic.mil/publications/04-2persist_id.html>http://cendi.dtic.mil/publications/04-2persist_id.html




At 07:57 AM 7/7/2006, Dan Connolly wrote:

>http://lists.w3.org/Archives/Public/public-semweb-lifesci/2006Jun/0210.html
>
> > The root of the problem is that the URL
> > contains in it more than just a name. It also contains the network
> > location where the only copy of the named object can be found (this is the
> > hostname or ip address)
>
>Which URL is that? It's not true of all URLs. Take, for example,
>   http://www.w3.org/TR/2006/WD-wsdl20-rdf-20060518/
>
>That URL does not contain the network location where the only
>copy can be found; there are several copies on mirrors around the
>globe.
>
>$ host www.w3.org
>www.w3.org has address 128.30.52.46
>www.w3.org has address 193.51.208.69
>www.w3.org has address 193.51.208.70
>www.w3.org has address 128.30.52.31
>www.w3.org has address 128.30.52.45
>
>
>FYI, the TAG is working on a finding on URNs, Namespaces, and Registries;
>the current draft has a brief treatment of this 
>issue of location (in)dependence...
>http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#loc_independent
>
>
> > as well as the only means by which one may
> > retrieve it (the protocol, usually http, https or ftp). The first question
> > to ask yourself here is that when you are uniquely naming (in all of space
> > and time!) a file/digital object which will be usefully copied far and
> > wide, does it make sense to include as an integral part of that name the
> > only protocol by which it can ever be accessed and the only place where
> > one can find that copy?
>
>If a better protocol comes along, odds are good that it will be usable
>with names starting with http: .
>
>See section 2.3 Protocol Independence
>http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#protocol_independent
>
>
> > Unfortunately when it
> > comes to URL?s there is no way to know that what is served one day will be
> > served out the next simply by looking at the URL string. There is no
> > social convention or technical contract to support the behavior that would
> > be required.
>
>Again, that's not true for all URLs. There are social and technical
>means to establish that
>
>   http://www.w3.org/TR/2006/WD-wsdl20-rdf-20060518/
>
>can be cached for a long time.
>
>The social mechanism includes published policies such as...
>
>"As of this note, persistent resources include:
>      1. ...
>      2. Those which start "http://www.w3.org/TR/" immediately followed
>         by four decimal digits."
>  --- http://www.w3.org/Consortium/Persistence
>
>and the technical mechanisms include HTTP caching headers:
>   Expires: Sat, 07 Jul 2007 12:51:56 GMT
>
>   (a 1 year expiry time is the maximum time per rfc2616)
>
>--
>Dan Connolly, W3C http://www.w3.org/People/Connolly/
>D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E

James D. Myers
Associate Director, Cyberenvironments and Technologies, NCSA
jimmyers@ncsa.uiuc.edu

Received on Friday, 7 July 2006 14:08:19 UTC