Re: Persistence from Biplav Srivastava on 2011-11-23 (public-gld-wg@w3.org from November 2011)

From: Biplav Srivastava <sbiplav@in.ibm.com>
Date: Wed, 23 Nov 2011 18:55:00 +0530
To: Sandro Hawke <sandro@w3.org>
Cc: Bernadette Hyland <bhyland@3roundstones.com>, Phil Archer <phila@w3.org>, Public GLD WG <public-gld-wg@w3.org>, Richard Cyganiak <richard@cyganiak.de>, "Ronald P. Reck" <rreck@rrecktek.com>
Message-ID: <OF74440ED4.0E15A158-ON65257951.004649DF-65257951.0049BF8B@in.ibm.com>
Hi,

There is a meta-level issue regarding persistence which I think is either 
assumed or answered somewhere that needs to be referenced. I will request 
us to consider:
What is the time-frame of persistence? Are we referring to a few years, 
10s of years, 100s of years, or beyond. In the real world, physical 
artifacts have persisted ranging from millions of years to those that 
struggle to last even 5 years. Why would we want to persist a government 
department's homepage that has merged into others? 
Associated issue is the cost if we forget this information. All 
information is eventually forgotten - what differs is the time scale. 
Persisting an information for a time duration (say 100 years) needs to be 
paid by someone. The analogy to real life is how long a burial ground is 
reserved for a person. Only a few people in history have been fortunate 
enough to have a place that has lasted ~500 years (e.g., Taj Mahal)
As W3C sub-group, shouldn't we be guiding on the persistence duration? 
URLs will be the mechanism to record persistence for the web community, 
but we need to know the economics of persistence. 
Regards,
--Biplav
 
Biplav Srivastava, Ph.D., Senior Researcher & IBM Master Inventor, IBM 
Research - India,  
Email: sbiplav@in.ibm.com | Mobile: +91-9810404142 | FAX: +91-11-26138889 
|

 



From:
Sandro Hawke <sandro@w3.org>
To:
Bernadette Hyland <bhyland@3roundstones.com>
Cc:
Phil Archer <phila@w3.org>, Richard Cyganiak <richard@cyganiak.de>, 
"Ronald P. Reck" <rreck@rrecktek.com>, Public GLD WG 
<public-gld-wg@w3.org>
Date:
11/23/2011 07:42 AM
Subject:
Re: Persistence



On Tue, 2011-11-22 at 14:12 -0500, Bernadette Hyland wrote:
> Hi,
> Let's please remember our mantra, "re-use, re-use, re-use" and "Open
> Source for Open Government Data." 

Please remember that the W3C community includes vendors of proprietary
software.  We are happy to see both open source and closed source
implementations of our Recommendations. 

> On Nov 22, 2011, at 10:12 AM, Phil Archer wrote:
> 
> > This makes a lot of sense.
> > 
> > Who would be that will executor? In the case of public sector
> > websites, presumably the relevant national archive? Is there a
> > business model here I wonder ;-)
> > 
> 
> 
> Yes, its called a non-profit like OCLC who for better or for worse,
> supports the worldwide library community through the purls.org domain.
> A look at OCLC's website is instructive.  The library community has
> wrestled & solved many of the issues the LD/LOD community raise.  See
> below.

Purls are a reasonable way to avoid having to pay domain-name
registration fees, front-end hosting fees, and having to manage a
front-end, but I don't think they solve the big problems here.

I don't see anything on the purl.org site suggesting they are willing or
able to take on the executor role, or even to work with someone who
would take on this role.    I expect they could be persuaded to do the
latter, and I'd hope they might do the former, at least for the library
community.    (I'm happy they've adopted 303 redirects; it took quite a
few years.)

I'm not quite sure who the executor role could best be done by.  It
should be an institution that is more stable and trusted than the
day-to-day organization, and which is motivated to maintain public
trust.

I have a mental image from old fiction of the lawyer who spends 20 years
hunting down the heirs to some estate before delivering the inheritance.
I don't know how that actually works, though, or how to get a lawyer to
do that. 

Some ideas:

  - very large business service companies, eg IBM.  The money to pay for
the service should perhaps be put in a separate annuity account, so
providing the service remains eternally funded, and that business unit
can be sold off, but couldn't easily go bankrupt.
  - a private foundation with an endowment and some alignment of goals
(suggestions?   I don't know this space very well)
  - a stable, non-controversial government agency (eg National Archives,
as you say; in the US, perhaps the Library of Congress)
  - an international treaty organization (UN, WTO, ITU)
  - Internet organizations, like W3C, ICANN, ISoc
  - ...?

I don't think we should select these -- just describe what they should
do, and maybe in the directory we can have a few organizations willing
to do it.

> 
> > As for top level domains, some are more politically acceptable than
> > others of course. Perversely perhaps, it seems that a vocabulary
> > hosted on example.eu, example.us or example.gov.uk might face more
> > resistance to uptake than example.ie or example.ly, especially if it
> > spelled out a nice word like semantical.ly (which appears to be
> > available btw).
> 
> Are you kidding?? ly = Libya.  Anyone who approves that for government
> use should have their badge taken away.  Seriously?!

He did say "perversely".   I take the point to be that folks will not
always be rational about making decisions like that.   "id.gov.uk" looks
like national thing, while domain hacks like "semantical.ly" or
"repli.ca", for better or worse, to the lay person, do not.

> > 
> > What we're talking about is maintaining a set of URIs for the long
> > term for vocabularies. For documents and Web content in general, an
> > archivist might take a different view. Britain's National Archives
> > can, legitimately, say that, for example, the Bercow Report of July
> > 2008 is still publicly available online. It's at:
> > 
> > 
http://webarchive.nationalarchives.gov.uk/20080528125538/http://www.dcsf.gov.uk/bercowreview/docs/7771-DCSF-BERCOW.PDF

> > 
> > The issue though is that it used to be at
> > http://www.dcsf.gov.uk/bercowreview/docs/7771-DCSF-BERCOW.PDF
> > 
> 
> 
> This example is *precisely* the case for implementing a persistent
> identifier solution (also called a permanent URL architecture). 

I don't think a PURL architecture helps.   Let's look at why doesn't the
dcsf.gov.uk URL no longer works: 

1.  The department changed it's name.  It wanted to be known as
"education" not "dcsf".   So if its purls had "dcsf" in them, the
department would still want/need them changed.

2.  The department didn't plan for the future and/or doesn't care about
the past.  They still have the name "www.dcsf.gov.uk", and it's set up
to redirect.  Why aren't they redirecting the document Phil mentions?
They didn't think enough, before, to make it easy to maintain, and they
don't care enough now, to bother to maintain it.

As an organization, the W3C administers w3.org with some care to this.
Mostly, we have a institutional commitment to keep old URLs functioning.
Knowing we'll be doing that, it's unlikely we would ever make a
top-level directory like "bercowreview".

The actual policy is any staff member can claim a name like
"http://www.w3.org/2011/11/foo", but to get "http://www.w3.org/2011/bar"
or especially "http://www.w3.org/baz" requires top-level management
approval.  To get outside of "date space" (as with "baz"), you have to
convince management that for the rest of time, that will be a sensible
name for resources in that space.    After 11 years at W3C, I finally
got my first of those, with /egov.    (And I might be buying headaches
for us down the road, but I argued convincingly that having 2007 in the
URL made people think the content was old.)   News organizations and
some blog and CMS software seem to have mostly figured this out, now.

> PURLs  is one such Open Source project that is used extensively by the
> worldwide library community, the US Government Printing Office through
> the US Federal Depository Library Program (for which 3 Round
> Stones provides commercial support), National Center for Biomedical
> Ontology, Shared Names, among others.
> 
> 
> > and if anyone had linked to the original URI then someone following
> > that link would see a short HTML page explaining at the dcsf.gov.uk
> > site is no longer in operation, where the current live version is,
> > and where the archive is. That's a very basic message for humans and
> > no message at all for machines.
> > 
> > Hmmm... Given that the original URI of the doc is preserved within
> > the new one, it shouldn't be too hard to come up with a script that
> > automatically gave a 301 redirect *if* the target gave a sensible
> > 200 response and a helpful message in case the target lead to a 404?
> > 
> > Phil.
> > 
> 
> 
> Hang on, there is no need to write a script or recreate the wheel
> here. 

Surely the purl.org software is massive overkill.  Phil is talking about
a few lines in an apache config. 

*IF* the DCSF had used date-space exclusively, they could do all the
redirects like:

RewriteRule ((2007|2008|2009|2010).*) 
http://webarchive.nationalarchives.gov.uk/20080528125538/http://www.dcsf.gov.uk/$1


(for whatever years they no longer have active.)

If they didn't use date-space, they'd need to list all the resources
moved over to the web archive, like this:

RewriteRule bercowreview/docs/7771-DCSF-BERCOW.PDF 
http://webarchive.nationalarchives.gov.uk/20080528125538/http://www.dcsf.gov.uk/bercowreview/docs/7771-DCSF-BERCOW.PDF


That's more of a hassle, but it still works.   It could also be done
with a tiny CGI script that consults a database of all the archived
URLs.

> For permanent URLs that transcend changing infrastructure, I urge
> using the modern PURLs server has been running in production for 3+
> years as OCLC's PURLs service and 2+ years for the US Government
> Printing Office.  The predecessor to the modern PURLs server was in
> production for 12 years.  This is a tested & proven solution to
> permanent URL architecture.
> 
> 
> The Open Source PURLs server is a web-scale, production application
> with an easy to use interface (a bookmarklet for creating PURLs), and
> nice reporting capabilities for maintenance.
> 
> 
> PURLs is based on HTTP and URI specs from the IETF.  Recently we've
> thrown in some TAG decisions and W3C Best Practices for use with RDF
> and Linked Data (303 support).
> 
> 
> Check out the PURLs site & if you have further questions about
> production deployments, I'm happy to respond to them.  Re-use, re-use,
> re-use.

I have nothing against PURLs for folks who, for one reason or another,
can't realistically maintain their own web space on their own domain
names, but I don't really see it helping when:

   - organizations need their name in their URLs and sometimes
     need to change their name
   - organizations don't plan their URL space to allow for decades
     of accumulation
   - organizations disappear, or change their mission so much they
     will ignore the users they once pledged to support

These are the things I'm trying to address.  I think extra domain names
and living wills are a pretty good (if unproven) answer.

   -- Sandro
> 
> Cheers,
> 
> 
> Bernadette Hyland
> co-chair W3C Government Linked Data Working Group
> charter http://www.w3.org/2011/gld/charter
> 
> 
> > 
> > On 22/11/2011 14:30, Richard Cyganiak wrote:
> > > On 17 Nov 2011, at 19:26, Sandro Hawke wrote:
> > > > My strawman proposal would be:
> > > > 
> > > > - vocabularies should be given their own domain name, probably
> > > > in .net
> > > > (they are infrastructure).   this way full ownership as well as
> > > > maintenance duties can be transfered, legally, as necessary.
> > > 
> > > +1. Getting an own domain for the vocabulary also helps keeping
> > > the URIs short.
> > > 
> > > On the other hand, using something like purl.org also seems
> > > reasonable.
> > > 
> > > I'm agnostic regarding the top-level domain. I note that the .net
> > > TLD isn't terribly popular and I can't think of many current
> > > examples of vocabularies in the .net namespace.
> > > 
> > > > - there should be a two-level ownership structure, where one
> > > > disinterested, trusted, 3rd party (like the executor of a will)
> > > > retains
> > > > final control, but delegates to the creator/maintainer.   With
> > > > written
> > > > policies about what happens in various eventualities.   But,
> > > > basically,
> > > > if either of these parties loses interest, they can be smoothly
> > > > replaced, and if the creator/maintainer ceases operation or
> > > > stops acting
> > > > in good faith, it can be replaced.
> > > 
> > > Again, +1.
> > > 
> > > Best,
> > > Richard
> > > 
> > 
> > -- 
> > 
> > 
> > Phil Archer
> > W3C eGovernment
> > http://www.w3.org/egov/
> > 
> > http://philarcher.org
> > +44 (0)7887 767755
> > @philarcher1
> > 
> > 
> 
>
Received on Wednesday, 23 November 2011 13:56:38 UTC