- From: Jeffrey Mogul <mogul@pa.dec.com>
- Date: Mon, 08 Jan 96 10:58:16 PST
- To: koen@win.tue.nl (Koen Holtman)
- Cc: http-caching@pa.dec.com
[Luigi Rizzo:] "http://www.sine.com/sine?angle=0.01", "http://www.sine.com/sine?angle=0.02", "http://www.sine.com/sine?angle=0.03", ... although the replies are small I certainly wouldn't want to store many of them on my cache. And storing just a few of them may be completely pointless. This issue is really about cache replacement policies. This is not about how the HTTP protocol ensures that caching works right, but rather how a finite cache decides to make the most effective use of its own resources. Put simply, the question for a cache is: "when I want to add a new cache entry and I don't have enough space, which old entry (or entries) do I remove from the cache?" The choice of policy here can be done without a lot of specification in the HTTP protocol, although ideally the cache would make the replacement choice that maximizes performance according to some metric. There are several possible metrics, such as average response time and worst-case response time, that it may not be possible to optimize at once, and I think the choice of which metric to optimize has to be left up to the owner of the cache. Ignoring ALL the other query-related caching issues for a moment, in the case of the sine server, one could adopt a heuristic that says "for URLs of the form 'http://host/path?*', I will only save the 3 most recently used entries." In other words, the cache could use heuristics on the URL even though this is supposed to be opaque. There is no problem with doing this, because the replacement policy cannot affect cache correctness; it can only affect performance. There is a question remaining that affects the HTTP spec: is there anything that the protocol can do to make the replacement choice more effective? Koen Holtman writes: I have been thinking some time about the idea of nice servers saying `this content is cachable, but the chances of it being re-requested are very small, so I would not cache it if I were you' to user agent and proxy caches, to help them make better decisions on how to use their scarce disk space. One way to implement this idea would be to introduce a cache-control header Cache-control: estimated-visits-per-day=N with N being an estimate by the server of how often the resource is visited each day. Even very rough estimates (say 0.01, 0.1, 1, 10, 100, or 1000) could help caches a lot. Estimates could be made by content authors, or semi-automatically by access log statistics tools (counting GETs, and, especially, conditional GETs). This is a very good suggestion, because it expresses a fairly useful way to determine what to keep in a cache: the "reference interarrival time". Suppose that we had a perfect predictor of the future (an "oracle"). When we (i.e., our cache) is about to decide what to kick out of the cache, what would we ask the oracle? We would probably ask it "of all the entries in my cache, which one is going to be used last?" That is, which one is going to be re-referenced the longest time from now? We might also ask "and how much will it cost me to get it back?", depending on what metric we want to minimize. However, if we make the simplifying assumption that it will cost about the same as it did the last time we retrieved it, then we don't really need an oracle to answer this question. Since we don't have an oracle to predict the future, we usually have to rely on assuming that the past is a good predictor of the future. In other words, if we observe a certain interarrival time for past references to an object, we could predict the next reference time. Or if we know something about an object in addition to what we could observe about it, we could still express our prediction in terms of reference interarrival time. Koen's "hits per day" metric is just the reciprocal of the arrival time. We can discuss whether "hits per day" or "mean interarrival time in seconds" is the best way to talk about it, but it's the same thing. One suggestion I might make is to supply both a mean and a variance. That is, if you tell me: Bus A comes every 20 minutes, plus or minus 1 minute Bus B comes every 15 minutes on the average, but is sometimes 20 minutes late which bus stop would I wait at? Similarly for replacement policies: if the server returns both the mean and the variance, the cache could compare two entries with roughly similar means to decide which one is a safer bet. A high variance might come simply because the mean has been calculated from a very small number of observations. What about caches? After all, if you have this configuration: origin-server ------- cache-A ------- cache-B ------ one client + +------------ cache-C ------ many clients then the origin server might only see one hit per day, but cache-A might see hundreds of hits/day. It would be nice if cache-A could tell cache-B not just about the origin-server's estimate, but about its own estimate. For example (this is just a quick thought), supposing that the interarrival time info included the mean, the variance, *and* the number of observations (or perhaps the sum of the observations, the sum of the squares, and N). Then as this info passes through a cache, the cache can add its own locally-observed samples, resulting in a single measurement with a larger sample size. This still leaves a few potential problems. For example, we might need to think of a way to "age" the statistical information to deal with resources whose reference pattern changes dramatically. For example, on one day in late January, http://www.superbowl.com might have a very low interarrival rate, and the number of samples (N) could reach many millions. A few days later, the daily reference rate drops to a few hundred, but if you were to compute the mean over all of the samples, you would probably cache that page for several months longer. -Jeff
Received on Monday, 8 January 1996 19:03:55 UTC