Re: what would change for me? from Peter Ansell on 2007-10-31 (public-semweb-lifesci@w3.org from October 2007)

From: Peter Ansell <ansell.peter@gmail.com>
Date: Thu, 1 Nov 2007 08:38:18 +1000
To: "Jim Myers" <jimmyers@ncsa.uiuc.edu>
Cc: public-semweb-lifesci@w3.org, p.roe@qut.edu.au, j.hogan@qut.edu.au, futrelle@ncsa.uiuc.edu
Message-ID: <a1be7e0e0710311538i2afbb5c3xed1a61e41d95fa03@mail.gmail.com>
The ARK project has relatively easy subject areas in mind is possibly
what I am concerned about. By that I mean that they do not assume
versioning will occur on the objects. Once one assigns a tag to a
physical object or observation then it will not change due to the
physically fixed status of the subject. Scientists may be able to tag
their observations with this system, but it would have to still allow
them to make up somewhat human recognisable identifiers for less
stable scientific concepts which may migrate, but take the name with
them along the trail.

The inclusion of service level agreements, signatures, and checksums
over data, are nice elements which no other scheme seems to have quite
tackled yet. There is no reason why LSID for instance could not
implement a SOAP level signature and checksum, but the fact is that it
would be specific to that protocol. The ARK/ART metadata does have the
advantage that the "administrative" metadata elements, perhaps due to
their lack of complexity, can be distributed easily.

Sorry about my confusion with the date being included in the tag for
ART. I do like the element, its meaning was much simpler than I made
it out to be. And it is a good thing. Johnathon Rees made an
interesting point earlier in the thread when he asked which
organisation he was a part of at any one time in his career, a
question I couldn't really debate because it is a big issue.

If a solution can be found which allows one to easily tag observations
with arbitrary unique tags, while still allowing for other stable
recognisable tags, and unstable conceptual identifiers it would be
great for everyone involved. The scattering that has happened so far
in the area is unfortunate, because it is just holding people back
from developing really powerful applications utilising global
identifiers as first-class data citizens.

One of the elements that has particularly not been looked at is the
reasoning behind one choosing one cache provider over another. So far
it has been up to application developers to decide on these things on
a case by case basis. Sure, you can design algorithms to automate the
selection process, but there is no real input from the organisation
regarding its intentions, ie, is the organisation under a 2 year
contract, but after that expires it will no longer be around? There is
no way to tell from any web scraping whether they will be there in the
future. However, even given all that, I would rather have a single
identifier which is known before negotiating with data providers.

The concept of having different caches "mint" their own identifiers
for what is the same scientific concept or data stream, is worrying
because there is no easy way to unite these in a common conceptual
web. It makes sense if the person actually has a physical object and
they are referring to that object, not to the concept which is not
absolutely controlled by an organisation. That is not specifically an
ARK problem, because people all over will mint their own identifiers
if they have physical control over the object, even if it has a
community accepted identifier.

I would hesitate to give mapping authorities any extra duties beyond
routing you to the original author, mainly because the opaque
identifiers make it very clear that the point is not the concept, but
the final destination. If you want people to regognise something and
distribute it, it would be better to have a human recognisable
identifier which included the recognised organisation in some way, but
which indicates that it also have community approval and can be
sourced from multiple distributors based on other factors such as
their locality to the processing power to utilise the data. Don't get
me wrong though, in cases where there is no community process
involved, and it is easy to assert what it means to assign an
identifier to a specific object, the ARK system does it well, because
in these cases the identifier does not need to hold any weight at all.

Given that however, if it took the community years to distinguish Gene
Ontology item representing a cellular mechanism from at metabolic
pathway (completely made up for the sake of argument), then it is
important to keep the distinction. It is also important to give it an
actual identifier based on that which will not reflect a physical
ownership by the organisation which assigned the concept that
identifier, but will still be visible in some way because it is
accepted as being that way. Ie, it would be foolish to decide to turn
Gene Ontology identifiers into arbitrary strings based on which
organisation decides to include them in their local databases.
Essentially I think HCLS is coming at the discussion from that point
of view and debating more what to do given the situation that they
know to be occuring in scientific circles.

Authority is a big word for science :) ARK/ART, if accepted would fit
well to people doing original research, and archivists, possibly even
home citizens trying to do their homework better using the libraries
archive features, but when it comes to negotiated identifiers, it kind
of falls over, if only because it demands the action occur quickly,
and be unchanged forever after that. Its single greatest assumption
seems to make it fall over in dynamic conceptual circumstances.

Although I accept the conceptual namespaces could be linked to
organisations, it makes no difference whether NCBI or EBI actually
give you the gene sequences, basically what a scientist wants to know
is that they are getting something which was referenced on the protein
database as being related to the given protein, or something which has
a common gene name across different species in the HUGO
classifications. I am not a big fan of enforcing single unique
URI/URN's on concepts though. It is noble, but it puts too many extra
pieces of information onto the concept. Although, I do like
http://bio2rdf.org/go:123455, I also see the wisdom in using the Banff
Manifesto style URN's, like urn:bm:go:123455, mainly because they are
not huge strings which take time to mentally process if you ever have
to look at them, and it still gives you information... as long as you
know what BM is you are fine to ignore it ever after in mental terms.
It also does not put a specific organisation in, other than that
derived from the namespace, which is more powerful than just referring
to an organisation which has been given a completely random identifier
at one time by an authority. If there could be a database of these
namespaces I would not be opposed to it, but it shouldn't be something
you force onto the concept as is.

I hope you can get some insight into the overall problems from this,
because while they are unique, they could still use some direction
from more stable circles such as archival and observation tagging.

Peter

On 01/11/2007, Jim Myers <jimmyers@ncsa.uiuc.edu> wrote:
> Peter,
>
> Feel free to post the discussion.
>
>   I can understand and ~agree with your criticisms of ARK in terms of
> their emphasis on truly meaningless identifiers, etc. - that does
> sound very archival and has such a penalty in terms of ease-of-use
> that I'm not sure the long-term preservation arguments for making
> them absolutely opaque should win.
>
> The part I liked in the paper was mostly about the recognition of the
> need to separate curators from producers and to identify both. If you
> want to identify data at its source, you can't expect that the
> producer will have the resources/interest etc. for long term
> curation, there may be multiple curators harvesting data but with
> different policies, etc.
>
> In terms of the identifier structure - I'm not sure why you don't
> like ART - the idea of moving to tags was really to get rid of any
> (new) central authority for naming - the combination of a DNS name or
> email address with a date is unique for all time (with the meaning
> 'the holder of the DNS name/email address as of date X') and, if DNS
> or email breaks/goes away in the future, it does not invalidate
> existing name authority tags. Relying on the fact that any assigner
> of DNS or email addresses needs to avoid two people having the same
> identity at the same time, one can mint tags using any DNS/email
> authority one can find. We saw this as a way to drive name assignment
> to the edges - the first point at which sensor data hits a device
> with an IP address, one can define a name assigner authority tag (IP,
> date) to that device and assign an ID to the data.
>
> Perhaps the disconnect is that the date in ART is not intended to
> have anything to do with the identifier per se - the date helps
> uniquely identify the assigner, but the assigner would then mint IDs
> of the form, for example, (IP,date):n where n = 1,2,3,... Any
> semantics of how that ID relates to another ID in terms of
> versioning, etc would still have to be in metadata and there would be
> no implied relation between, for example, (IP, date1):12345 and (IP,
> date2):12345 - this just means that two authorities, who held that IP
> address at different times, have both used the string 12345 as part
> of their identifiers of completely different objects.
>
> I'm also not clear on why you say "it is the physical institution
> which can only ever have authority over an
> object due to the one to one link at the central worldwide authority"
> - ARKs and ARTs are in two parts so that once an ID is minted, copies
> of the object curated by different name mapping authorities could
> exist and be recognized as cached copies of each other, etc. It is
> not intended (certainly in ART and I don't think in ARKs) that one
> should be able to take a full ART/ARK and consider the mapping
> authority as  THE authority on the object - the initial naming
> authority was THE minter of the inner ID and the ID-to-object
> mapping, but after that, ANY mapper can mint a full actionable
> ART/ARK that represents 'their' copy or the inner ID'd object as
> preserved under their policies, etc. (You'd decide which mapping
> authorities you trust by looking at the policy metadata they provide
> (do they have a long-term mission, do they secure their servers, do
> they make digital checksums/signatures so they can be assured that
> their copy of the object hasn't been corrupted, etc. - whatever info
> makes sense).
>
> Does this info change how you see ARK/ART (at least these features of
> them)? Or am I still missing your issue with them?
>
> In any case, as I said in the first email, I'm less interested in
> pushing for a particular solution than in getting a clear discussion
> of the issues and potential designs for a solution and I think both
> the two part mapping authority/naming authority split in ARK, and the
> tag-style naming authority mechanism we added with ART help in the
> directions you were arguing and, as you recognized implicitly in
> emailing the list, I don't think those requirements have been well
> articulated and the schemes under discussion don't seem to me to
> address them.
>
> Thanks for the discussion,
>
>   Jim
>
> At 02:37 AM 10/31/2007, Peter Ansell wrote:
> >Hi Jim,
> >
> >I think it depends what your overall reason for needing to identify an
> >object. ART and ARK both have the implication that at least an overall
> >authority will be permanently established and available to register
> >organisations and keep other authorities up to date with what the
> >current routing tables are essentially. It feels way too much like a
> >copy of DNS, without any real improvements over DNS in the resolution
> >space.
> >
> >They both to me seem more suited to the archival space, as opposed to
> >an active online data repository space which needs to be able to
> >negotiate its way through changes to both the data related to objects
> >and their relationships with other objects. In the archival space it
> >seems to be important to get a label and have it stick as a single
> >universal label, whereas in science, concepts actually change in
> >response to work done by independent scientists. The archivists
> >strategy would be okay if the only change management strategy was to
> >migrate everyone over to the new object essentially, and leave the old
> >object as a relic which simply pointed to the new object, but this is
> >not how it works. Scientists want to both be able to use the old
> >information to verify their past results, while keeping up with new
> >understandings, and possibly running experiments again in response to
> >newly acquired information.
> >
> >I guess this could be where the date semantics in ART come in, but in
> >my opinion it is asking too much of a schema to include specific date
> >semantics, when it could be better represented in metadata. Using
> >metadata one can for instance arrange for an old identifier to reveal
> >that it has been superceded without needing to change the identifer,
> >or to specifically notify consumers.
> >
> >The HCLS group in some ways come at the issue with the understanding
> >that things should be both consistent over time, while not being
> >restricted to the current understanding about an object in the future.
> >I think that is what causes the most confusion, with even the
> >resolution semantics being determined by this aspect of the debate.
> >
> >ARK and ART both assume that the databases which link identifiers to
> >objects will survive past an organisations lifetime, and that even if
> >the organisation ceases to exist, that the data base can be slotted
> >into another organisation for hosting and maintenance purposes. While
> >it is important that archival happens, the way things are setup, it is
> >the physical institution which can only ever have authority over an
> >object due to the one to one link at the central worldwide authority.
> >I would rather that data which needs to be used by the community have
> >a systematic way of being released regularly so that it can be cached
> >and archived. I personally do not think the URI note that the HCLS
> >group are developing will be overly useful for either closed in
> >businesses, or archival type projects because of these assumptions.
> >
> >The idea that DNS systems will fail, but a system which relies on a
> >much less redundant system will survive changes is going to do more
> >harm then good to the consumer. Admittedly though, I can recognise
> >that the DOI system seems to have a mechanism for ensuring this
> >redundancy, so it could be assumed that ARK, if it is needed, could
> >reach that level eventually.
> >
> >Even after all of my discussion here, I think that the bio2rdf.org
> >system will survive the change and be applicable to me in the future.
> >It survives all of the challenges that I have put up so far even
> >without conforming specifically to a final specification which may
> >unite the centralised creative/science commons effort and the more
> >individualised bio2rdf effort. Its main feature from my perspective is
> >that it does not require everyone to individually publish everything
> >in exactly the same format. As long as the organisation provides the
> >data elements, the aggregation program can be configured to
> >standardise the information, as it is best positioned to do this.
> >True, it still relies on someone creating a namespace and converter
> >for a specific organisational provider, but it does it in such a way
> >that everything is transparent and clear to an observer. ARK in
> >particular goes out of its way to make its identifiers and authority
> >labels unclear and cryptic.
> >
> >Digital signatures and hash functions embedded within metadata are
> >less invasive for consumers than the check digit in-identifier
> >mechanism they promote, and a human recognisable string for an
> >"authority/namespace" is going to benefit users more than a completely
> >arbitrary number. Their reasoning based on the idea that acronyms and
> >strings as authority identifiers go out of date doesn't really have
> >any basis in the reality of public web databases, and it is only
> >really public databases that need to be worried about the effects of
> >their identifiers on outside consumers.
> >
> >Hope that wasn't too long a rant.
> >
> >Do you mind if I forward this to the public-semweb mailing list?
> >
> >I am CC'ng my PhD supervisors at my university.
> >
> >Cheers,
> >
> >Peter Ansell
> >
> >On 30/10/2007, Jim Myers <jimmyers@ncsa.uiuc.edu> wrote:
> > >
> > >  Peter,
> > >
> > >  I lurk on the list but generally work in areas outside the life sciences -
> > > I saw your email and like the points you make. While not wanting
> > to add 'yet
> > > another identifier scheme' to the discussion I've pointed a
> > couple people to
> > > the paper by John Kunze on the Archival Resource Key identifier scheme (
> > > http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf) because
> > > I think it has a nice discussion of the issues you raise. John
> > distinguishes
> > > between data producers and the series of curators that may hold
> > it a various
> > > times and translates that into a need for a two part identifier that, in my
> > > terminology refer to 'the thing' and 'the thing as curated by organization
> > > X', and for a means to get both the metadata about the thing and about the
> > > policies of the curating organization.
> > >
> > >  In technical terms, we've had a few quibbles with ARKs and we've
> > proposed a
> > > few changes (Joe Futrelle, Actionable Resource Tags for Virtual
> > > Organizations,
> > > http://www.ncsa.uiuc.edu/People/futrelle/docs/art2006.pdf,
> > > An NCSA Technical Report) that would make it simpler for us to mint IDs on
> > > many platforms (every scientists' desktop, sensors in the field), but I
> > > really found the discussion of IDs in the paper helpful in thinking the
> > > problem through.
> > >
> > >   Cheers,
> > >
> > >    Jim
> > >
> > >
> > >
> > >  James D. Myers
> > >  Associate Director, Cyberenvironments and Technologies, NCSA
> > >  1205 W. Clark St, MC-257
> > >  Urbana, IL 61801
> > >  217-244-1934
> > >  jimmyers@ncsa.uiuc.edu
> > >
>
> James D. Myers
> Associate Director, Cyberenvironments and Technologies, NCSA
> 1205 W. Clark St, MC-257
> Urbana, IL 61801
> 217-244-1934
> jimmyers@ncsa.uiuc.edu
>
>
Received on Wednesday, 31 October 2007 22:38:41 UTC