Re: Link rot in Supreme Court decisions from Herbert van de Sompel on 2013-09-30 (www-tag@w3.org from September 2013)

From: Herbert van de Sompel <hvdsomp@gmail.com>
Date: Mon, 30 Sep 2013 12:04:57 -0600
To: Larry Masinter <masinter@adobe.com>
Cc: "ashok.malhotra@oracle.com" <ashok.malhotra@oracle.com>, Karl Dubost <karl@la-grange.net>, Mark Nottingham <mnot@mnot.net>, "www-tag@w3.org WG" <www-tag@w3.org>, Herbert van de Sompel <hvdsomp@gmail.com>
Message-ID: <CAOywMHfW2O6EqQSDkoaTorf2Ebi4r-S61BvtDZiqLWzcLV8tVw@mail.gmail.com>

On Sat, Sep 28, 2013 at 10:35 PM, Larry Masinter <masinter@adobe.com> wrote:
> to solve "link rot" you have to solve "storage rot":
>
> having a reliable archive for insuring permanent access to referenced material is both
> * necessary: without access to some representation of the material, the persistent pointer is useless
> * sufficient: any permanent way of accessing material must of necessity have an index system for identifying the material preserved.
>

For many web resources, prior versions are available from a variety of
sources, not just one source:
* crawl-based archives such as the Internet Archive, and national web
archives such as the British Library, the UK National Archives, and
the Icelandic web archives,
* subscription-based web archives such as Archive-It,
* on-demand web archives such as archive.is and perma.cc,
* transactional web archives, cf SiteStory, cf
http://mementoweb.github.io/SiteStory/,
* content management systems with time-based versioning such as
Wikipedia and all MediaWiki installations.

All these sources of resource versions have their own index, which in
essence contains the following information per version resource:
- URI-R of the original resource
- URI-M of the versioned resource
- version datetime

The Memento protocol specifies an interoperable approach to interact
with those indexes. It consists of two components:
- Datetime negotiation with a TimeGate for an original resource: Given
a URI-R of the original resource and a preferred datetime return a
URI-M for a versioned resource that is temporally close to the
preferred datetime. Note that for a CMS the exact version that was
active at the preferred datetime will be returned. For web archives,
how close the returned version is to the preferred datetime depends on
the coverage of the archive for the original resource.
- List of all versions of the original resource via a TimeMap for an
original resource: Such a list details the URI-R of the original
resource, and for each resource version the URI-M of a version
resource as well as its version datetime.

Some resources/servers express a preference for a certain archive. For
example, DBpedia provides an HTTP Link pointing at the DBpedia
Archive. Most resources don't, in which case a Memento client will
decide itself which archive to interact with. The Memento extension
for Chrome that will be pre-released today, allows a user to set a
preference for a default web archive.

Admittedly, these sources of prior resource versions do not cover all
prior versions of all resources. But there's a significant body of
prior resource versions out there. For example, the Internet Archive
is said to currently contain 335 billion archived web resources. To
put it differently, there's a significant body of URIs out there for
which machine-actionable temporal information added to a link, as
proposed in the document I shared, would be useful rather than
useless. Hence, it would be nice to see a discussion that is more
about that aspect of the reference rot problem that is addressed in
the document I shared, and less about those aspect that the document
has no proposal for and for which it relies on ongoing international
efforts pertaining to web archiving.

Cheers

Herbert

-- 
Herbert Van de Sompel
Digital Library Research & Prototyping
Los Alamos National Laboratory, Research Library
http://public.lanl.gov/herbertv/

==

Received on Monday, 30 September 2013 18:05:28 UTC