Continuing discussion on Cache Digest

[ with my "cache digest co-author" hat on ]

In discussions about Cache Digest, one of the questions that came up was whether or not it was necessary to use a digest mechanism (e.g., Bloom filter, Golumb compressed set), or whether or not we could just send a list of the cached representations.

Curious about this, I whipped up a script to parse the contents of Chrome's cache, to get some idea as to how many cached responses per origin a browser keeps.

See:
  https://gist.github.com/mnot/793fcfb0d003e87ea7e8035c43eafdb9
and responses to:
  https://twitter.com/mnot/status/766542805980155905

The caveats around this are too numerous to cover, but to mention a few:
  - this is just anecdata, and a very small sample at that
  - it's skewed towards: 
	a) people who follow me on Twitter; 
	b) people who use Chrome; 
	c) people who can easily run a Python program (leaving most Windows users out)
  - it includes both fresh and stale cached responses
  - it assumes that the Chrome URL gives the complete and correct state of the cache

Looking at the responses (five so far) and keeping that in mind, a few observations:

1. Unsurprisingly, the number of cached responses per origin appears to follow (roughly) a Zipf curve, like so many other Web stats do
2. Origins with tens of cached responses appear to be very common
3. Origins with hundreds of cached responses appear to be not uncommon at all
4. Origins with thousands of cached responses are encountered

More data is, of course, welcome.

My early take-away is that if we design a mechanism where the cached responses are enumerated, instead of having the entire cache's contents for the origin digested, there needs to be some mechanism whereby the most relevant cached responses are selected.

The most likely time to do that is when the responses themselves are first cached; e.g., with a cache-control extension. I think the challenges that such a scheme would face are:

a) Keeping the advertisement concise (because it should fit into a navigation request, without bumping into another RT of congestion window)
b) Being able to express the presence of a larger number of URLs (since one of the effects of HTTP/2 is atomisation into a larger number of smaller resources), with bits of state like "fresh/stale" attached
c) Being manageable for the origin (since they'll effectively have to predict what URLs are important to know about ahead of time, and in the face of site changes)

To me, this makes CD more attractive, because we have more confidence that (a) and (b) are in hand, and (c) isn't a worry because the entire origin's cache state will be sent. Provided that the security/privacy issues are in hand, and that it's reasonably implementable by clients, I think CD also has a better chance of success because it decouples the sending of the cache state from its use, making it easier to reuse the data on the server side without close client coordination.

So, I think the things that we do need to work on in CD are:

1) Choosing a more efficient hash algorithm and assuring that it's reasonable to implement in browsers
2) Refining the flags / operation models so that it's as simple and sensible as possible (but we need feedback on how clients want to send it)
3) Defining a way for origins to opt into getting CD, rather than always sending it.

Does this sound reasonable?

--
Mark Nottingham   https://www.mnot.net/

Received on Saturday, 20 August 2016 01:00:01 UTC