Replacement policies

[Luigi Rizzo:]
           "http://www.sine.com/sine?angle=0.01",
           "http://www.sine.com/sine?angle=0.02",
           "http://www.sine.com/sine?angle=0.03",
           ...
    although the replies are small I certainly wouldn't want to store many
    of them on my cache. And storing just a few of them may be completely
    pointless.

This issue is really about cache replacement policies.  This is not
about how the HTTP protocol ensures that caching works right, but
rather how a finite cache decides to make the most effective use of its
own resources.

Put simply, the question for a cache is: "when I want to add a new
cache entry and I don't have enough space, which old entry (or entries)
do I remove from the cache?"

The choice of policy here can be done without a lot of specification
in the HTTP protocol, although ideally the cache would make the
replacement choice that maximizes performance according to some
metric.  There are several possible metrics, such as average response
time and worst-case response time, that it may not be possible to
optimize at once, and I think the choice of which metric to optimize
has to be left up to the owner of the cache.

Ignoring ALL the other query-related caching issues for a moment,
in the case of the sine server, one could adopt a heuristic that
says "for URLs of the form 'http://host/path?*', I will only
save the 3 most recently used entries."  In other words, the cache
could use heuristics on the URL even though this is supposed to
be opaque.  There is no problem with doing this, because the replacement
policy cannot affect cache correctness; it can only affect performance.

There is a question remaining that affects the HTTP spec: is there
anything that the protocol can do to make the replacement choice
more effective?

Koen Holtman writes:
    I have been thinking some time about the idea of nice servers saying
    
     `this content is cachable, but the chances of it being re-requested
     are very small, so I would not cache it if I were you'
    
    to user agent and proxy caches, to help them make better decisions on
    how to use their scarce disk space.
    
    One way to implement this idea would be to introduce a
    cache-control header
    
      Cache-control: estimated-visits-per-day=N
    
    with N being an estimate by the server of how often the resource is
    visited each day.  Even very rough estimates (say 0.01, 0.1, 1, 10,
    100, or 1000) could help caches a lot.
    
    Estimates could be made by content authors, or semi-automatically by
    access log statistics tools (counting GETs, and, especially,
    conditional GETs).

This is a very good suggestion, because it expresses a fairly useful
way to determine what to keep in a cache: the "reference interarrival
time".

Suppose that we had a perfect predictor of the future (an "oracle").
When we (i.e., our cache) is about to decide what to kick out of
the cache, what would we ask the oracle?  We would probably ask it
"of all the entries in my cache, which one is going to be used last?"
That is, which one is going to be re-referenced the longest time from
now?

We might also ask "and how much will it cost me to get it back?",
depending on what metric we want to minimize.  However, if we make
the simplifying assumption that it will cost about the same as it
did the last time we retrieved it, then we don't really need an
oracle to answer this question.

Since we don't have an oracle to predict the future, we usually have
to rely on assuming that the past is a good predictor of the future.
In other words, if we observe a certain interarrival time for past
references to an object, we could predict the next reference time.
Or if we know something about an object in addition to what we could
observe about it, we could still express our prediction in terms
of reference interarrival time.

Koen's "hits per day" metric is just the reciprocal of the arrival
time.  We can discuss whether "hits per day" or "mean interarrival
time in seconds" is the best way to talk about it, but it's the same
thing.

One suggestion I might make is to supply both a mean and a variance.
That is, if you tell me:

	Bus A comes every 20 minutes, plus or minus 1 minute

	Bus B comes every 15 minutes on the average, but is sometimes
		20 minutes late

which bus stop would I wait at?  Similarly for replacement policies:
if the server returns both the mean and the variance, the cache
could compare two entries with roughly similar means to decide which
one is a safer bet.  A high variance might come simply because the
mean has been calculated from a very small number of observations.

What about caches?  After all, if you have this configuration:

	origin-server ------- cache-A ------- cache-B ------ one client
				+
				+------------ cache-C ------ many clients

then the origin server might only see one hit per day, but
cache-A might see hundreds of hits/day.  It would be nice if cache-A
could tell cache-B not just about the origin-server's estimate,
but about its own estimate.

For example (this is just a quick thought), supposing that the
interarrival time info included the mean, the variance, *and*
the number of observations (or perhaps the sum of the observations,
the sum of the squares, and N).  Then as this info passes through
a cache, the cache can add its own locally-observed samples,
resulting in a single measurement with a larger sample size.

This still leaves a few potential problems.  For example, we might need
to think of a way to "age" the statistical information to deal with
resources whose reference pattern changes dramatically.  For example,
on one day in late January, http://www.superbowl.com might have a very
low interarrival rate, and the number of samples (N) could reach many
millions.  A few days later, the daily reference rate drops to a few
hundred, but if you were to compute the mean over all of the samples,
you would probably cache that page for several months longer.

-Jeff

Received on Monday, 8 January 1996 19:03:55 UTC