Re: RDF Update Feeds + URI time travel on HTTP-level from Erik Hetzner on 2009-11-26 (public-lod@w3.org from November 2009)

From: Erik Hetzner <erik.hetzner@ucop.edu>
Date: Thu, 26 Nov 2009 11:11:02 -0800
To: public-lod@w3.org
Cc: Michael Nelson <mln@cs.odu.edu>, Herbert Van de Sompel <hvdsomp@gmail.com>, Robert Sanderson <azaroth42@gmail.com>
Message-ID: <P-IRC-EXBE01sBk31TZ0000359a@EX.UCOP.EDU>
At Wed, 25 Nov 2009 00:21:04 -0500,
Michael Nelson wrote:
> Hi Erik,
> 
> Thanks for your response.  I'm just going to cherry pick a few bits from 
> it:
> 
> > As an aside, which may or may not be related to Memento, do you think
> > there is a useful distinction to be made between web archives which
> > preserve the actual bytestream of an HTTP response made at a certain
> > time (e.g., the Internet Archive) and CMSs that preserve the general
> > content, but allow headers, advertisements, and so on to change (e.g.,
> > Wikipedia).
> >
> > To see what I mean, visit:
> >
> > http://en.wikipedia.org/w/index.php?title=World_Wide_Web&oldid=9419736
> >
> > and then:
> >
> > http://web.archive.org/web/20050213030130/en.wikipedia.org/wiki/World_Wide_Web
> >
> > I am not sure what the relationship is between these two resources.
> 
> I'm not 100% sure either.  I think this is a difficult problem in web 
> archiving in general.  The wikipedia link with current content substituted 
> is not exactly the 2005 version, but the IA version isn't really what a 
> user would have seen in 2005 either (at least in terms of presentation).
> 
> And:
> 
> http://web.archive.org/web/20080103014411/http://www.cnn.com/
> 
> for example gives me at least a pop-up add that is relative to today, not 
> Jan 2008 (there may be better examples where "today's" content is 
> in-lined, but the point remains the same).

I can’t find the popup, but the point is well taken.

The problem of what I call ‘breaking out’ of archived web content is a
very real one when archived web sites are displayed without browser
support, using URI ‘rewriting’ and other tricks. The possibility of
coming up with a solution for this problem is one reason why I am very
excited about this discussion.

Still, I think the intention of IA is different from that of
Wikipedia’s previous versions. IA attempts to capture and replay the
web exactly was it was, while Wikipedia presents its essential content
in the same way while surrounding it with the latest tools.

While either solution would be helpful to somebody researching the
history of a Wikipedia article or to somebody looking for the previous
version, only IA’s approach gives you the advertisements, etc., that
can be very helpful for researchers.

There is the further issue of the fact that IA’s copy is a third part
and in some ways more trustworthy. Whether sites can generally be
trusted to maintain accurate archives of their own content is a
question that has already been answered, in my opinion. (The answer
is, they can’t.) See, e.g., [1].

> As an aside, the Zoetrope (http://doi.acm.org/10.1145/1498759.1498837) 
> took an entirely different approach to this problem in their archives (see 
> pp. 246-247).  They basically took DOM dumps from the client and saved 
> them, rather than a crawler-based URI approach.

Thanks for the pointer.

> > My confusion on this issue stems, I believe, from a longstanding
> > confusion that I have had with the 302 Found response.
> >
> > My understanding of 302 Found has always been that, if I visit R and
> > receive a 302 Found with Location R', my browser should continue to
> > consider R the canonical version and use it for all further requests.
> > If I bookmark R' after having been redirected to R, it is in fact R
> > which should be bookmarked, and not R'. If I use my browser to send
> > that link to a friend, my browser should send R, not R'. I believe
> > that this is the meaning given to 302 Found in [3].
> >
> > I am aware that browsers do not implement what I consider to be the
> > correct behavior here, but it is the way that I understand the
> > definition of 302 Found.
> >
> > Perhaps somebody could help me out by clarifying this for me?
> 
> Firefox will attempt to do the right thing, but it depends on the client 
> maintaining state about the original URI.  If you dereference R, then get 
> 302'd to R', a reload in Firefox will be on R and not R'.

I hadn’t noticed this before, thank you for pointing it out.

> Obviously, if you email or share or probably even bookmark R', then this 
> client-side state will be lost and 3rd party reloads will be relative to 
> R' (in fact, that might be want you *want* to occur).  But at least within 
> a session, Firefox (and possibly other browsers) will reload wrt to the 
> original URI.
> 
> Although it is not explicit in the current paper or presentation, we're 
> planning on some method for having R' "point" back to R to facilitate 
> Memento-aware clients to know the original URI.  We're not sure 
> syntactically how it should be done (a value in the "Alternates" response 
> header maybe?), but semantically we want R' to point to R.  This

I think your email got cut off there.

In any case, in the context actual existing implementations of 302, I
think Memento is doing the correct thing. That is, redirection from R
to the appropriate content (R') based on conneg make sense to me, for
Memento, if what the user can bookmark and see is the conneg’ed URI
(R')

My belief (see [2] and especially [3]) is that properly behaving
clients should bookmark R, not R'.

I think that this could be problematic for Memento because the
X-Accept-DateTime header could be lost with the bookmarking, as I
mentioned in my previous message.

But I think I may be beating a dead horse, because obviously clients
in the real world behave by bookmarking and displaying R', not R.

best,
Erik Hetzner

1. <http://www.clinecenter.uiuc.edu/airbrushing_history/>
2. <http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html>
3. <http://www.w3.org/DesignIssues/UserAgent.html>
Received on Thursday, 26 November 2009 19:12:09 UTC