Re: Some data related to the frequency of cache-busting from Shel Kaphan on 1996-12-04 (ietf-http-wg@w3.org from October to December 1996)

From: Shel Kaphan <sjk@amazon.com>
Date: Tue, 3 Dec 1996 22:34:43 -0800 (PST)
To: Jeffrey Mogul <mogul@pa.dec.com>
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <199612040634.WAA31269@anaconda.amazon.com>

Jeffrey Mogul writes:
 >     There's another category of cache-busting that you did not mention in
 >     the statistics you reported.  This is the use of unique URL
 >     components, which may be "once-only" URLs, or are at least unique for
 >     a single user.
 > 
 > Right you are.  I should have been more explicit in the title of
 > my message, and I didn't explain it clearly enough in the body
 > of the message, but this analysis was only aimed at finding instances
 > of cache-busting that might easily be avoided through use of our
 > hit-metering proposal.  I thought it would be more realistic to
 > look for cache-busting that is done without using the unique-URL
 > technique. 
 > 

Yes, sure.  You'd have to resort to unreliable heuristic techniques to
pick out such URLs.  In fact, you're likely to have already considered
them in one of your other categories, since they are more likely to
show up as invocations of CGI programs and the like, rather than
static ".html" URLs -- *something* on the server end has to interpret
or strip off the unique part of the URL.  Unless the http server
itself has been hacked, it will be a CGI program or the moral
equivalent.

 > It's not clear to me whether the users of once-only URLs would
 > switch to a more cache-friendly approach if our hit-metering
 > proposal were available.  (Clearly, anyone that requires
 > cache-busting to provide usable results in the face of broken
 > history mechanisms is not going to switch, at least not until
 > virtually all browsers have fixed their history support.)  So
 > I therefore assumed that non of the once-only URLs would be
 > amenable to hit-metering, and so I did not try to include these
 > URLs in my category of "possibly cache-busted responses."
 > 

They're mainly not amenable to hit metering because it's impossible to
algorithmically determine the "equivalence class" of once-only URLs --
all the superficially distinct URLs that fetch "the same" resource
look like different URLs. Anyway I'd have to guess that the
overwhelming majority of servers that work using unique URLs do it
more for semantics than explicitly for cache-busting.
One question that must be asked about this:  is this technique
prevalent enough to be worth worrying much about?  I see it a lot, but
then, I pay attention to sites that do stuff like this.

 > On the other hand, it's not clear that I could have identified them
 > from their names.  If they were pre-expired or had no last-modified
 > date, and they did not match my CGI filter, I would have included
 > them in my category of "possibly cache-busted responses" by mistake.
 > 
but that "mistake" is actually OK, right?

 > When I am ready to re-do the analysis, I'll try a version that is
 > limited to URLs for which the trace contains at least two status-200
 > responses.  Presumably this will avoid any once-only URLs, right?

It will avoid true "once-only" URLs,  but you still might see
some matches on "per-session" URLs -- ones that track a user through a
session.  These per-session URLs are also fairly pointless to cache in a
shared cache, since they're only relevant to one user, but that user
might ask for the same thing more than once.  Based purely
on anecdotal evidence I think per-session URLs are a lot more common than
true once-only URLs.

 > However, it will decrease the sample size by a large factor, which
 > means that the significance of the results may be weakened.
 > 
 > -Jeff
 > 
 > 

--Shel

Received on Tuesday, 3 December 1996 22:40:44 UTC