Re: RDF Update Feeds + URI time travel on HTTP-level from Erik Hetzner on 2009-11-25 (public-lod@w3.org from November 2009)

From: Erik Hetzner <erik.hetzner@ucop.edu>
Date: Tue, 24 Nov 2009 20:13:51 -0800
To: Herbert Van de Sompel <hvdsomp@gmail.com>
Cc: Linked Data community <public-lod@w3.org>, "Michael L. Nelson" <mln@cs.odu.edu>, Robert Sanderson <azaroth42@gmail.com>
Message-ID: <P-IRC-EXBE01pHkatd4000030bf@EX.UCOP.EDU>

Hi -

At Mon, 23 Nov 2009 21:02:37 -0700,
Herbert Van de Sompel wrote:
> 
> On Nov 23, 2009, at 4:59 PM, Erik Hetzner wrote:
> > I think this is a very compelling argument.
> 
> Actually, I don't think it is. The issue was also brought up (in a
> significantly more tentative manner) in Pete Johnston blog entry on
> eFoundations
> (http://efoundations.typepad.com/efoundations/2009/11/memento-and-negotiating-on-time.html
> ). Tomorrow, we will post a response that will try and show that
> "current state" issue is - as far as we can see - not quite as
> "written in stone" as suggested above in the specs that matter in
> this case, i.e. Architecture of the World Wide Web and RFC 2616.
> Both are interestingly vague about this.

Thanks for your response, both in email and in your posted document
[1]. As with your the paper, I have read it & found a lot to consider
and a lot that I agree with.

> > On the other hand, there is, nothing I can see that prevents one
> > URI from representing another URI as it changes through time. This
> > is already the case with, e.g.,
> > <http://web.archive.org/web/*/http://example.org>, which
> > represents the URI <http://example.org> at all times. So this URI
> > could, perhaps, be a target for X-Accept-Datetime headers.
> 
> That is actually what we do in Memento (see our paper
> http://arxiv.org/abs/0911.1112), and we recognize two cases, here:
> 
> (1) If the web server does not keep track of its own archival
> versions, then we must rely on archival versions that are stored
> elsewhere, i.e. in Web Archives. In this case, the original server
> who receives the request can redirect the client to a resource like
> the one you mention above, i.e. a resource that stands for archived
> versions of another resource. Note that this redirect is a simple
> redirect like the ones that happen all the time on the Web. This is
> not a redirect that is part of a datetime content negotiation flow,
> rather a redirect that occurs because the server has detected an X-
> Accept-Datetime header. Now, we don't want to overload the existing
> <http://web.archive.org/web/*/http://example.org> as you suggest,
> but rather choose to introduce a special-purpose resource that we
> call a TimeGate
> <http://web.archive.org/web/timegate/http://example.org>. And we
> indeed introduce this resource as a target for datetime content
> negotiation.
> 
> (2) If the web server does keep track of its own archival versions
> (think CMS), then it can handle requests for old versions "locally"
> as it has all the information that is required to do so. In this
> case, we could also introduce a special-purpose, distinct, TimeGate
> on this server, and have the original resource redirect to it. That
> would make this case in essence the same as (1) above. This,
> however, seemed like a bit of overkill and we felt that the original
> resource and the Timegate could coincide; meaning datetime content
> negotiation occurs directly against the original resource. Meaning
> the URI that represents the resource as it evolves over time is the
> URI of the resource itself. It stands for past and present versions.
> The present version is delivered (200 OK) from that URI itself
> (business as usual), archived versions are delivered from other
> resources via content negotiation (302 with Location different than
> the original URI)
> 
> In In both (1) and (2) the original resource plays a role in the  
> framework, either because it redirects to an external TimeGate that  
> performs the datetime content negotiation, or because it performs the  
> datetime content negotiation itself. And we actually think that is  
> quite essential that this original resource is involved. It is the URI  
> of the original resource by which the resource has been known as it  
> evolved over time. It makes sense to be able to use that URI to try  
> and get to its past versions. And by "get", I don't mean search for  
> it, but rather use the network to get there. After all, we all go by  
> the same name irrespective of the day you talk to us. Or we have the  
> same Linked Data URI irrespective of the day it is dereferenced. Why  
> would we suddenly need a new URI when we want to see what the LoD  
> description for any of us was, say, a year ago? Why must we prevent  
> that this same URI helps us to get to prior versions?

I agree completely that a user should be able to discover - if
possible - archived archival web content, either on the original
server or in a web archive, for a given URI, starting from a point of
knowing only the original URI. I know that the UK National Archives
have built a system for redirecting 404s on UK government web sites to
their web archive [2], and I was very happy to see your work
attempting to standardize something similar, although more general.

As an aside, which may or may not be related to Memento, do you think
there is a useful distinction to be made between web archives which
preserve the actual bytestream of an HTTP response made at a certain
time (e.g., the Internet Archive) and CMSs that preserve the general
content, but allow headers, advertisements, and so on to change (e.g.,
Wikipedia).

To see what I mean, visit:

http://en.wikipedia.org/w/index.php?title=World_Wide_Web&oldid=9419736

and then:

http://web.archive.org/web/20050213030130/en.wikipedia.org/wiki/World_Wide_Web

I am not sure what the relationship is between these two resources.

> > There is something else that I find problematic about the Memento
> > proposal. Archival versions of a web page are too important to hide
> > inside HTTP headers.
> >
> > To take the canonical example, if I am viewing
> > <http://oakland.example.org/weather>, I don’t want the fact that I am
> > viewing historical weather information to be hidden in the request
> > headers.
> 
> It is not. The _request_ for prior versions is in a request header.
> The response will come from a URI different than
> <http://oakland.example.org/weather >, e.g.
> <http://oakland.example.org/20091012/weather> or
> <http://web.archive.org/web/20091012/http://oakland.example.org/weather
> > and there will be a response header provided by the server that
> delivers this response (X-Archive-Interval) that informs the client
> unambiguously that the response _is_ an archived version. This info
> can be leveraged by the client to give the archived version the
> position of first class citizen it deserves.
> 
> > Furthermore, I am viewing resource X as it appeared at time T1, I
> > should *not* be able to copy that URI and send it to a friend, or use
> > it as a reference in a document, only to have them see the URI as it
> > appears at time T2.
> 
> You will not. You would copy the URI
> <http://oakland.example.org/20091012/weather> or
> <http://web.archive.org/web/20091012/http://oakland.example.org/weather>.
> I think the misconception in this discussion is that the archived
> version is _delivered_ by the original URI. It is not. The archived
> version is _requested_ via the original URI, and it is _delivered_
> by a resource at another URI. As is the case with all content
> negotiation.

Thanks for clarifying the intent here, which I clearly misunderstood.

My confusion on this issue stems, I believe, from a longstanding
confusion that I have had with the 302 Found response.

My understanding of 302 Found has always been that, if I visit R and
receive a 302 Found with Location R', my browser should continue to
consider R the canonical version and use it for all further requests.
If I bookmark R' after having been redirected to R, it is in fact R
which should be bookmarked, and not R'. If I use my browser to send
that link to a friend, my browser should send R, not R'. I believe
that this is the meaning given to 302 Found in [3].

I am aware that browsers do not implement what I consider to be the
correct behavior here, but it is the way that I understand the
definition of 302 Found.

Perhaps somebody could help me out by clarifying this for me?

> > I think that those of us in the web archiving community [1] would
> > very much appreciate a serious look by the web architecture
> > community into the problem of web archiving. The problem of
> > representing and resolving the tuple <URI, time> is a question
> > which has not yet been adequately dealt with.
> 
> I hope that with Memento we have provided a significant contribution
> towards addressing that question. I think our paper at
> http://arxiv.org/abs/0911.1112 describes the proposed solution in
> quite some details, and addresses quite some of the concerns raised
> in the discussion on this list, so far. And, as indicated before,
> there's also the slides in case there is not enough time to read the
> paper
> (http://www.slideshare.net/hvdsomp/memento-time-travel-for-the-web
> ).

I agree that you have provided a significant contribution, and I quite
enjoyed reading your paper. My apologies if anything else was implied.

best,
Erik Hetzner

1. http://www.cs.odu.edu/~mln/memento/response-2009-11-24.html
2. http://www.nationalarchives.gov.uk/webcontinuity/
3. http://www.w3.org/DesignIssues/UserAgent.html

;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3

Received on Wednesday, 25 November 2009 04:14:43 UTC