Re: dont-revalidate Cache-Control header

On 14/07/2015 10:31 a.m., Ilya Grigorik wrote:
> On Fri, Jul 10, 2015 at 3:29 AM, Amos Jeffries <squid3@treenet.co.nz> wrote:
> 
>>> At Facebook, we use this method to serve our static resources. However
>>> we've noticed that despite our nearly infinite expiration dates we see
>>> 10-20% of requests (depending on browser) for static resource being
>>> conditional revalidation. We believe this happens because UAs perform
>>> revalidation of requests if a user refreshes the page. Our statistics
>> show
>>> that about 2% of navigations to FB are reloads -- however these requests
>>> cause a disproportionate amount of traffic to our static resources
>> because
>>> they are never served by the user's cache.
>>
>> That tells me that 10-20% of your traffic is probably coming from a
>> HTTP/1.1 proxy cache. Whether it reveals itself as a proxy or not.
>>
>> Speaking for Squid, we limit caching time at 1 year**. After which
>> objects get revalidated before use. Expires header in HTTP/1.1 only
>> means that objects are stale and must be revalidated before next use.
>> Proxy with existing content does that with a synthesized revalidation
>> request even if the client that triggered it did a plain GET. Thereafter
>> the proxy has a new Expires value to use*** until that itself expires.
> 
> 
> Amos, not sure I follow the proxy conclusion.. I'm reading this correctly,
> it sounds like if I specify a 1 year+ max-age, then Squid will revalidate
> the object for each request?

No, for the first year you get normal caching behaviour. Then from the
1yr mark you get one-ish revalidation, and the new copy is used the
resets the 1yr counter. So instead of getting things cached forever /
68yrs (possibly by error). You get at least one revalidation check per
year per object.


> If so, ouch. However, unless that gotcha
> accounts for all of the extra revalidations, why would the proxy cause more
> revalidations? Intuitively, shouldn't it reduce the number of revalidations
> by collapsing number of requests to FB origin?

It would reduce the total requests yes. But in doing so I would expect
some ratio of total requests (relative to the amount of proxy usage) to
become revalidation instead of full fetches. Since a proxy does not have
a reload button, and its quite popular to configure proxy caches to
convert client reload into an IMS request.

What I mean is; you may find those 10% of revalidations instead of
disappearing, could turn into full fetches. That would raise the total
latency cost by the bandwidth transfer time without reducing the server
CPU latency to identify object versions.


> 
> (also, as Ben noted, due to HTTPS, I doubt that's the culprit...)
> 

FWIW: Judging by the user queries and Debian/Ubuntu install statistics
in the past year some 10,000+ Squid devices (minimum. Lots of CentOS,
RHEL and BSD unaccounted for) have been converted to HTTPS MITM due to
administrative and legal requirements (specifically for filtering
Facebook and Google traffic). Use of HTTPS is not avoiding the proxy
interaction.

(This conversion of previously perfectly standards compliant installs to
MITM is depressing).

> 
> On Sat, Jul 11, 2015 at 10:58 AM, Ben Maurer wrote:
> 
>> One major issue with this solution is that it doesn't address situations
>> where content is embedded in a third party site. Eg, if a user includes an
>> API like Google Maps or the Facebook like button those APIs may load
>> subresources that should fall under this stricter policy. This issue cuts
>> both ways -- if 3rd party content on your site isn't prepared for these
>> semantics you could break it.
> 
> 
> Hmm, I think a markup solution would still work for the embed case:
> - you provide a stable embed URL with relatively short TTL (for quick
> updates)
> - embedded resource is typically HTML (iframe) or script, that initiates
> subresources fetches
> -- said resource can add appropriate attributes/markup on its subresources
> to trigger the mode we're discussing here
> 
> ^^ I think that would work, no? Also, slight tangent.. Fetch API has notion
> of "only-if-cached" and "force-cache", albeit both of those are skipped on
> "reload", see step 11:
> https://fetch.spec.whatwg.org/#http-network-or-cache-fetch.
> 
> On Mon, Jul 13, 2015 at 2:57 AM, Ben Maurer wrote:
> 
>> We could also study this in the HTTP Archive -- if I took all resources
>> that had a 30 day or greater max age and send their servers revalidation
>> requests 1 week from today, what % of them return a 304 vs other responses.
> 
> 
> Not perfect, but I think it's should offer a pretty good estimate:
> http://bigqueri.es/t/how-many-resources-persist-across-a-months-period/607
> 
> - ~48% of resource requests end up requesting the same URL (after 30 days).
> Of those...
> -- ~84% fetch the same content (~40% of all request and ~33% of total bytes)
> -- ~16% fetch different content (~8% of all requests and ~9% of total bytes)
> 

Pretty much inline with what we see in average proxy cache HIT rates.
Specific ISP situations vary from 20%-60% caching depending on customer
counts vs storage space size. Those who can cache only a weeks traffic
get lower rates than those caching a months, etc.

I expect to see similar numbers from browser caches.

Amos

Received on Tuesday, 14 July 2015 09:13:31 UTC