Re: what would change for me? from Jim Myers on 2007-11-01 (public-semweb-lifesci@w3.org from November 2007)

From: Jim Myers <jimmyers@ncsa.uiuc.edu>
Date: Thu, 01 Nov 2007 17:58:56 -0500
To: "Peter Ansell" <ansell.peter@gmail.com>
Cc: public-semweb-lifesci@w3.org, p.roe@qut.edu.au, j.hogan@qut.edu.au, futrelle@ncsa.uiuc.edu
Message-Id: <6.2.3.4.2.20071101171821.058d1db8@pop.ncsa.uiuc.edu>
Peter,

Comments below...

At 05:38 PM 10/31/2007, Peter Ansell wrote:

>The ARK project has relatively easy subject areas in mind is possibly
>what I am concerned about. By that I mean that they do not assume
>versioning will occur on the objects. Once one assigns a tag to a
>physical object or observation then it will not change due to the
>physically fixed status of the subject. Scientists may be able to tag
>their observations with this system, but it would have to still allow
>them to make up somewhat human recognisable identifiers for less
>stable scientific concepts which may migrate, but take the name with
>them along the trail.

I agree it is aimed more at data sets than concepts and doesn't have 
any built-in mechanism to deal with versioning for example. But, 
since the protocol allows you to access arbitrary metadata about the 
objects, there's a way to add this. (Also - I don't think the 
distinction between data and concepts is hard - for example one may 
have the concept of 'current-best data set' that changes as 
calibration/QA/QC procedures change.)


>The inclusion of service level agreements, signatures, and checksums
>over data, are nice elements which no other scheme seems to have quite
>tackled yet. There is no reason why LSID for instance could not
>implement a SOAP level signature and checksum, but the fact is that it
>would be specific to that protocol. The ARK/ART metadata does have the
>advantage that the "administrative" metadata elements, perhaps due to
>their lack of complexity, can be distributed easily.

I agree - I think ARK does a nice job of realizing that there's 
metadata about the object and about its curation and in providing a 
way to get that metadata - what that metadata is or how complex it 
gets seems like the right thing to argue about (i.e. don't bake one 
answer into the protocol).


>Sorry about my confusion with the date being included in the tag for
>ART. I do like the element, its meaning was much simpler than I made
>it out to be. And it is a good thing. Johnathon Rees made an
>interesting point earlier in the thread when he asked which
>organisation he was a part of at any one time in his career, a
>question I couldn't really debate because it is a big issue.

Good - glad it makes sense now that the intent is clear.


>If a solution can be found which allows one to easily tag observations
>with arbitrary unique tags, while still allowing for other stable
>recognisable tags, and unstable conceptual identifiers it would be
>great for everyone involved. The scattering that has happened so far
>in the area is unfortunate, because it is just holding people back
>from developing really powerful applications utilising global
>identifiers as first-class data citizens.

Agreed - I see ARK/ART as consistent with / usable in this space, 
just not sufficient without some versioning, ability to relate 
identifiers, etc. - but if we knew what made sense here, it can drop 
in as metadata and not require a protocol/naming scheme-level change.

>One of the elements that has particularly not been looked at is the
>reasoning behind one choosing one cache provider over another. So far
>it has been up to application developers to decide on these things on
>a case by case basis. Sure, you can design algorithms to automate the
>selection process, but there is no real input from the organisation
>regarding its intentions, ie, is the organisation under a 2 year
>contract, but after that expires it will no longer be around? There is
>no way to tell from any web scraping whether they will be there in the
>future. However, even given all that, I would rather have a single
>identifier which is known before negotiating with data providers.
>
>The concept of having different caches "mint" their own identifiers
>for what is the same scientific concept or data stream, is worrying
>because there is no easy way to unite these in a common conceptual
>web. It makes sense if the person actually has a physical object and
>they are referring to that object, not to the concept which is not
>absolutely controlled by an organisation. That is not specifically an
>ARK problem, because people all over will mint their own identifiers
>if they have physical control over the object, even if it has a
>community accepted identifier.

What ARK allows is for two holders of an artifact or concept to do is 
to mint two ARKs that share the common ID minted by the originator 
and to make it clear that these are two references to the same thing 
curated by two different curators. Again, this doesn't solve the 
whole problem of aliases, and it may make more sense in the data 
rather than concept space, but it does give you a way to have an 
actionable (e.g. http GETable) identifier and have a way to find 
other curated copies of the object if that curator goes off-line.

>I would hesitate to give mapping authorities any extra duties beyond
>routing you to the original author, mainly because the opaque
>identifiers make it very clear that the point is not the concept, but
>the final destination. If you want people to regognise something and
>distribute it, it would be better to have a human recognisable
>identifier which included the recognised organisation in some way, but
>which indicates that it also have community approval and can be
>sourced from multiple distributors based on other factors such as
>their locality to the processing power to utilise the data. Don't get
>me wrong though, in cases where there is no community process
>involved, and it is easy to assert what it means to assign an
>identifier to a specific object, the ARK system does it well, because
>in these cases the identifier does not need to hold any weight at all.

I think ARK recognizes that even community organizations may not have 
infinite lifetimes, so even there you may need a way for a new 
curator to become more than a pass-through. The case is clearer when 
you focus on data sets and individual contributions where the 
lifetime over which the original namer may keep the object accessible 
is much shorter. (Which is why we added the tag-style name authority 
extension with ART - if you don't expect the originator to keep the 
data accessible, it is not clear that you want to rely on them/force 
them into using a central authority mechanism...)

>Given that however, if it took the community years to distinguish Gene
>Ontology item representing a cellular mechanism from at metabolic
>pathway (completely made up for the sake of argument), then it is
>important to keep the distinction. It is also important to give it an
>actual identifier based on that which will not reflect a physical
>ownership by the organisation which assigned the concept that
>identifier, but will still be visible in some way because it is
>accepted as being that way. Ie, it would be foolish to decide to turn
>Gene Ontology identifiers into arbitrary strings based on which
>organisation decides to include them in their local databases.
>Essentially I think HCLS is coming at the discussion from that point
>of view and debating more what to do given the situation that they
>know to be occuring in scientific circles.
>
>Authority is a big word for science :) ARK/ART, if accepted would fit
>well to people doing original research, and archivists, possibly even
>home citizens trying to do their homework better using the libraries
>archive features, but when it comes to negotiated identifiers, it kind
>of falls over, if only because it demands the action occur quickly,
>and be unchanged forever after that. Its single greatest assumption
>seems to make it fall over in dynamic conceptual circumstances.

I don't see how it enforces things being unchanged - it does not have 
a mechanism to describe change but could still be used in such a framework.

>Although I accept the conceptual namespaces could be linked to
>organisations, it makes no difference whether NCBI or EBI actually
>give you the gene sequences, basically what a scientist wants to know
>is that they are getting something which was referenced on the protein
>database as being related to the given protein, or something which has
>a common gene name across different species in the HUGO
>classifications. I am not a big fan of enforcing single unique
>URI/URN's on concepts though. It is noble, but it puts too many extra
>pieces of information onto the concept. Although, I do like
>http://bio2rdf.org/go:123455, I also see the wisdom in using the Banff
>Manifesto style URN's, like urn:bm:go:123455, mainly because they are
>not huge strings which take time to mentally process if you ever have
>to look at them, and it still gives you information... as long as you
>know what BM is you are fine to ignore it ever after in mental terms.
>It also does not put a specific organisation in, other than that
>derived from the namespace, which is more powerful than just referring
>to an organisation which has been given a completely random identifier
>at one time by an authority. If there could be a database of these
>namespaces I would not be opposed to it, but it shouldn't be something
>you force onto the concept as is.

I'll just note that the two identifiers you use as examples have the 
separation of the thing (go:123455) from the actionable URL/URN you 
can use to get to it/information about it. As I said in my first 
reply before this went to the list - I like the ARK paper by Kunze 
that explains why this separation helps more than ARK itself, i.e. 
looking at ARK and the reasoning behind it will help any identifier 
scheme even if ARK itself has problems for a given use.


>I hope you can get some insight into the overall problems from this,
>because while they are unique, they could still use some direction
>from more stable circles such as archival and observation tagging.

Yep - all valuable discussion and I agree that the cross 
fertilization is key (e-Science has been a great banner/driver to 
foster such exchanges!)

Cheers,

  Jim

>Peter
>
>On 01/11/2007, Jim Myers <jimmyers@ncsa.uiuc.edu> wrote:
> > Peter,
> >
> > Feel free to post the discussion.
> >
> >   I can understand and ~agree with your criticisms of ARK in terms of
> > their emphasis on truly meaningless identifiers, etc. - that does
> > sound very archival and has such a penalty in terms of ease-of-use
> > that I'm not sure the long-term preservation arguments for making
> > them absolutely opaque should win.
> >
> > The part I liked in the paper was mostly about the recognition of the
> > need to separate curators from producers and to identify both. If you
> > want to identify data at its source, you can't expect that the
> > producer will have the resources/interest etc. for long term
> > curation, there may be multiple curators harvesting data but with
> > different policies, etc.
> >
> > In terms of the identifier structure - I'm not sure why you don't
> > like ART - the idea of moving to tags was really to get rid of any
> > (new) central authority for naming - the combination of a DNS name or
> > email address with a date is unique for all time (with the meaning
> > 'the holder of the DNS name/email address as of date X') and, if DNS
> > or email breaks/goes away in the future, it does not invalidate
> > existing name authority tags. Relying on the fact that any assigner
> > of DNS or email addresses needs to avoid two people having the same
> > identity at the same time, one can mint tags using any DNS/email
> > authority one can find. We saw this as a way to drive name assignment
> > to the edges - the first point at which sensor data hits a device
> > with an IP address, one can define a name assigner authority tag (IP,
> > date) to that device and assign an ID to the data.
> >
> > Perhaps the disconnect is that the date in ART is not intended to
> > have anything to do with the identifier per se - the date helps
> > uniquely identify the assigner, but the assigner would then mint IDs
> > of the form, for example, (IP,date):n where n = 1,2,3,... Any
> > semantics of how that ID relates to another ID in terms of
> > versioning, etc would still have to be in metadata and there would be
> > no implied relation between, for example, (IP, date1):12345 and (IP,
> > date2):12345 - this just means that two authorities, who held that IP
> > address at different times, have both used the string 12345 as part
> > of their identifiers of completely different objects.
> >
> > I'm also not clear on why you say "it is the physical institution
> > which can only ever have authority over an
> > object due to the one to one link at the central worldwide authority"
> > - ARKs and ARTs are in two parts so that once an ID is minted, copies
> > of the object curated by different name mapping authorities could
> > exist and be recognized as cached copies of each other, etc. It is
> > not intended (certainly in ART and I don't think in ARKs) that one
> > should be able to take a full ART/ARK and consider the mapping
> > authority as  THE authority on the object - the initial naming
> > authority was THE minter of the inner ID and the ID-to-object
> > mapping, but after that, ANY mapper can mint a full actionable
> > ART/ARK that represents 'their' copy or the inner ID'd object as
> > preserved under their policies, etc. (You'd decide which mapping
> > authorities you trust by looking at the policy metadata they provide
> > (do they have a long-term mission, do they secure their servers, do
> > they make digital checksums/signatures so they can be assured that
> > their copy of the object hasn't been corrupted, etc. - whatever info
> > makes sense).
> >
> > Does this info change how you see ARK/ART (at least these features of
> > them)? Or am I still missing your issue with them?
> >
> > In any case, as I said in the first email, I'm less interested in
> > pushing for a particular solution than in getting a clear discussion
> > of the issues and potential designs for a solution and I think both
> > the two part mapping authority/naming authority split in ARK, and the
> > tag-style naming authority mechanism we added with ART help in the
> > directions you were arguing and, as you recognized implicitly in
> > emailing the list, I don't think those requirements have been well
> > articulated and the schemes under discussion don't seem to me to
> > address them.
> >
> > Thanks for the discussion,
> >
> >   Jim
> >
> > At 02:37 AM 10/31/2007, Peter Ansell wrote:
> > >Hi Jim,
> > >
> > >I think it depends what your overall reason for needing to identify an
> > >object. ART and ARK both have the implication that at least an overall
> > >authority will be permanently established and available to register
> > >organisations and keep other authorities up to date with what the
> > >current routing tables are essentially. It feels way too much like a
> > >copy of DNS, without any real improvements over DNS in the resolution
> > >space.
> > >
> > >They both to me seem more suited to the archival space, as opposed to
> > >an active online data repository space which needs to be able to
> > >negotiate its way through changes to both the data related to objects
> > >and their relationships with other objects. In the archival space it
> > >seems to be important to get a label and have it stick as a single
> > >universal label, whereas in science, concepts actually change in
> > >response to work done by independent scientists. The archivists
> > >strategy would be okay if the only change management strategy was to
> > >migrate everyone over to the new object essentially, and leave the old
> > >object as a relic which simply pointed to the new object, but this is
> > >not how it works. Scientists want to both be able to use the old
> > >information to verify their past results, while keeping up with new
> > >understandings, and possibly running experiments again in response to
> > >newly acquired information.
> > >
> > >I guess this could be where the date semantics in ART come in, but in
> > >my opinion it is asking too much of a schema to include specific date
> > >semantics, when it could be better represented in metadata. Using
> > >metadata one can for instance arrange for an old identifier to reveal
> > >that it has been superceded without needing to change the identifer,
> > >or to specifically notify consumers.
> > >
> > >The HCLS group in some ways come at the issue with the understanding
> > >that things should be both consistent over time, while not being
> > >restricted to the current understanding about an object in the future.
> > >I think that is what causes the most confusion, with even the
> > >resolution semantics being determined by this aspect of the debate.
> > >
> > >ARK and ART both assume that the databases which link identifiers to
> > >objects will survive past an organisations lifetime, and that even if
> > >the organisation ceases to exist, that the data base can be slotted
> > >into another organisation for hosting and maintenance purposes. While
> > >it is important that archival happens, the way things are setup, it is
> > >the physical institution which can only ever have authority over an
> > >object due to the one to one link at the central worldwide authority.
> > >I would rather that data which needs to be used by the community have
> > >a systematic way of being released regularly so that it can be cached
> > >and archived. I personally do not think the URI note that the HCLS
> > >group are developing will be overly useful for either closed in
> > >businesses, or archival type projects because of these assumptions.
> > >
> > >The idea that DNS systems will fail, but a system which relies on a
> > >much less redundant system will survive changes is going to do more
> > >harm then good to the consumer. Admittedly though, I can recognise
> > >that the DOI system seems to have a mechanism for ensuring this
> > >redundancy, so it could be assumed that ARK, if it is needed, could
> > >reach that level eventually.
> > >
> > >Even after all of my discussion here, I think that the bio2rdf.org
> > >system will survive the change and be applicable to me in the future.
> > >It survives all of the challenges that I have put up so far even
> > >without conforming specifically to a final specification which may
> > >unite the centralised creative/science commons effort and the more
> > >individualised bio2rdf effort. Its main feature from my perspective is
> > >that it does not require everyone to individually publish everything
> > >in exactly the same format. As long as the organisation provides the
> > >data elements, the aggregation program can be configured to
> > >standardise the information, as it is best positioned to do this.
> > >True, it still relies on someone creating a namespace and converter
> > >for a specific organisational provider, but it does it in such a way
> > >that everything is transparent and clear to an observer. ARK in
> > >particular goes out of its way to make its identifiers and authority
> > >labels unclear and cryptic.
> > >
> > >Digital signatures and hash functions embedded within metadata are
> > >less invasive for consumers than the check digit in-identifier
> > >mechanism they promote, and a human recognisable string for an
> > >"authority/namespace" is going to benefit users more than a completely
> > >arbitrary number. Their reasoning based on the idea that acronyms and
> > >strings as authority identifiers go out of date doesn't really have
> > >any basis in the reality of public web databases, and it is only
> > >really public databases that need to be worried about the effects of
> > >their identifiers on outside consumers.
> > >
> > >Hope that wasn't too long a rant.
> > >
> > >Do you mind if I forward this to the public-semweb mailing list?
> > >
> > >I am CC'ng my PhD supervisors at my university.
> > >
> > >Cheers,
> > >
> > >Peter Ansell
> > >
> > >On 30/10/2007, Jim Myers <jimmyers@ncsa.uiuc.edu> wrote:
> > > >
> > > >  Peter,
> > > >
> > > >  I lurk on the list but generally work in areas outside the 
> life sciences -
> > > > I saw your email and like the points you make. While not wanting
> > > to add 'yet
> > > > another identifier scheme' to the discussion I've pointed a
> > > couple people to
> > > > the paper by John Kunze on the Archival Resource Key 
> identifier scheme (
> > > > http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf) because
> > > > I think it has a nice discussion of the issues you raise. John
> > > distinguishes
> > > > between data producers and the series of curators that may hold
> > > it a various
> > > > times and translates that into a need for a two part 
> identifier that, in my
> > > > terminology refer to 'the thing' and 'the thing as curated by 
> organization
> > > > X', and for a means to get both the metadata about the thing 
> and about the
> > > > policies of the curating organization.
> > > >
> > > >  In technical terms, we've had a few quibbles with ARKs and we've
> > > proposed a
> > > > few changes (Joe Futrelle, Actionable Resource Tags for Virtual
> > > > Organizations,
> > > > http://www.ncsa.uiuc.edu/People/futrelle/docs/art2006.pdf,
> > > > An NCSA Technical Report) that would make it simpler for us 
> to mint IDs on
> > > > many platforms (every scientists' desktop, sensors in the field), but I
> > > > really found the discussion of IDs in the paper helpful in thinking the
> > > > problem through.
> > > >
> > > >   Cheers,
> > > >
> > > >    Jim
> > > >
> > > >
> > > >
> > > >  James D. Myers
> > > >  Associate Director, Cyberenvironments and Technologies, NCSA
> > > >  1205 W. Clark St, MC-257
> > > >  Urbana, IL 61801
> > > >  217-244-1934
> > > >  jimmyers@ncsa.uiuc.edu
> > > >
> >
> > James D. Myers
> > Associate Director, Cyberenvironments and Technologies, NCSA
> > 1205 W. Clark St, MC-257
> > Urbana, IL 61801
> > 217-244-1934
> > jimmyers@ncsa.uiuc.edu
> >
> >

James D. Myers
Associate Director, Cyberenvironments and Technologies, NCSA
1205 W. Clark St, MC-257
Urbana, IL 61801
217-244-1934
jimmyers@ncsa.uiuc.edu
Received on Thursday, 1 November 2007 22:59:14 UTC