Re: FYI Cache-control deployment from Adrien de Croy on 2009-11-25 (ietf-http-wg@w3.org from October to December 2009)

From: Adrien de Croy <adrien@qbik.com>
Date: Thu, 26 Nov 2009 11:49:42 +1300
To: "Roy T. Fielding" <fielding@gbiv.com>
CC: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <4B0DB486.6090901@qbik.com>
Roy T. Fielding wrote:
> On Nov 25, 2009, at 1:27 PM, Adrien de Croy wrote:
>   
>> Hi All
>>
>> this is just for interest sake.  Part of our load testing we hammer our proxy with a whole bunch of crawlers out onto the 'net.  In the last run we were testing our new cache.  After about a million hits crawling sites, I was wondering why we only had about 200,000 files in cache.  We cache anything with a cache validator (ETag, Last-modified), freshness info (Expires), or appropriate Cache-Control response directives (max-age, s-maxage, public, must-revalidate etc).  It seemed to me the cachability of the net was not great, which limits cache effectiveness.
>>
>> So I turned on counting of each different Cache-control header combination we received.  The results were quite interesting.
>>
>> * About 70% of responses didn't include a Cache-control header at all
>>     
>
> Which means they use the default caching, as intended.
>   
Or not.  since I'm only getting a 20% strike rate I guess a large 
proportion of these ones aren't specifying any validators either.

One thing we don't do is heuristic caching.  Does this mean heuristic 
caching is the most-used form of caching?

>   
>> * Of the remaining 30%, about 80% used the Cache-control header to prevent caching (no-store, private).
>>     
>
> Again, that's often intended.
>   
Sure.   Another thing I noted - very small proportion using Pragma 
header in response (which is good).  < 1%

>   
>> So only about 7% of sites seem to be using Cache-control to actually specify how to cache something (e.g. specify freshness and revalidation information).  This is quite disappointing.
>>
>> There were quite a few sites that sent conflicting directives. The private directive is odd, since there was no authentication going on.
>>     
>
> Private is to indicate the cacheable response is not to be
> shared even though authentication is not going on.  If auth
> were present, there would be no need to indicate private
> because that is the default with auth.
>   
right as per s2.1 and 3.2 of draft-ietf-httpbis-p6-cache-08.txt - thanks.

>   
>> The numbers above are only approximate, if anyone is interested, I can post better / more rigorous results after our next test. 
>> It does seem to show on the face of it that
>>
>> a) Cache-control isn't well supported in the wild
>>     
>
> No, that is not what it means at all.
>
>   
OK, what I should have said is not well used.  I wasn't trying to make a 
claim about whether the software supports it.  But it is used in a 
minority of responses, and then mostly to prohibit storage.

Given the goals of the header, and its potential, and the potential for 
caching to improve user experience, this is quite disappointing.  From 
the spec one gets the impression that a key goal of the design was to 
enable more effective and efficient caching by providing more explicit 
boundaries for revalidation and freshness.  However this part of it 
hasn't really been rolled out in practise.   I guess there's really 
nothing that a max-age directive can say that an Expires header can't.

I didn't count occurences of Expires, or other validators, or whether 
these were combined with Cache-control.  I can do that though - could be 
interesting.

>> b) There's a lot of confusion about Cache-control directives (based on the combinations people choose).
>>     
>
> I have no cure for that.  No additional specification will
> help those people.  Splitting caching into a separate part might.
>   
It certainly has made life a lot easier for me to implement our new cache.

regards the freshness model in part 2.3, it talks about

"The HTTP/1.1 specification does not provide specific
   algorithms, but does impose worst-case constraints on their results."

I see a SHOULD level requirement for heuristic expiry calcs in 2.3.1.1, 
but no real constraints (apart from which status codes it can be used 
for).  Are these still being worked out?  Maybe easier to remove the 
sentence from 2.3.

If there's no Last-Modified header, is is prohibited to use heuristic 
freshness?  Or is the common approach to assume that something is 
cachable if there are no headers present indicating the contrary?  If 
there are no validators though, without any basis for freshness to be 
calculated, the response can't reliably be served.  one would have to 
resort to an administratively specified freshness lifetime (perhaps 
varying by content-type or other parameters) applying from when the 
response was first received.

We used to do this - it causes no end of trouble with dynamic sites.

Regards
Adrien

> ....Roy
>
>
>
Received on Wednesday, 25 November 2009 22:46:19 UTC