Cache-busting - cause and prevention Martin Hamilton, Loughborough University Andrew Daviel, Vancouver Webpages $Revision: 1.4 $ Abstract Cache-Busting is the sometimes deliberate, sometimes inadvertant, practice of defeating caching. This document explains the nature of the problem, with relation to proxy caches using the World-Wide Web's HTTP protocol and outlines some simple measures which may be taken to make a WWW service more cache friendly. 1. The rationale for caching A large number of Internet sites have elected to run proxy HTTP [1,2] servers. These act as intermediaries between end users' World-Wide Web browsers and the (predominantly) HTTP servers they connect to. Proxies are typically set up in order that :- users behind firewalls can have access to WWW services and/or commonly requested objects can be cached Proxy caches offer additional functionality above and beyond the WWW browser's own built-in cache, since cached objects may be shared with the entire population of users and with cooperating proxy cache servers. By contrast - browser caches are typically private to the individual, or can only be shared with those browsers which have access to the filesystem on which the cached objects are found. A cache's effectiveness is usually measured in terms of its "hit rate" - the ratio of requests which may be satisfied using cached objects. The goal of the cache administrator is to make this figure as high as possible, whilst simultaneously maintaining a large cache of objects. Cache hit rates of 40% to 50% for WWW related traffic are common, for example. Caching also helps to make more effective use of the available bandwidth by allowing TCP congestion control algorithms to work properly - conventional HTTP traffic takes the form of a very large number of short lived TCP connections, which often defeats TCP "slow-start" [3] on busy lines. It follows that proxy caching is highly attractive to Internet Service Providers and organisations which buy connectivity from them, on a cost/benefit basis. Cache hits are typically delivered and order of magnitude faster than cache misses, for leaf node caches at least. This means that a site which encourages caching will provide the end user with a much higher perceived quality of service. 2. The cache-busting problem Support in the HTTP protocol and its implementations for proxies and caching is something which has essentially been retro-fitted. As a result, there are many common practices which are incompatible with it, and either defeat caching completely or reduce the benefits which derive from it. This is primarily an educational issue involving developers of WWW services and implementors of HTTP. It is also the case that caching at the HTTP level can cause problems for services which make heavy use of usage statistics - e.g. to provide "hit counts" for advertisers. Users of cached copies of an object are effectively invisible to the provider of the original service. This may provide a strong motivation to defeat caching. 3. How to avoid cache-busting There are a number of positive steps which developers of HTTP based WWW services may take to be cache-friendly :- 3.1 Things to try Use a server which supports HTTP 1.1 - this has a number of additional features to support caching. Use the Expires header on documents and images where feasible - this will help caches to decide when your objects are stale. Use an HTTP server which supports the GET method with the If-Modified-Since header - this will help browsers and proxy caches to figure out whether their cached copy of a file is out of date. Make CGI programs cacheable where practical :- Use GET instead of POST for simple queries, since POST results aren't cached. Use the path component of the URL to pass information instead of QUERY_STRING - caches may treat objects with a ? in their URL as uncacheable. Use a directory name other than "cgi-bin", since caches can be expected to treat URLs containing this as uncacheable. Generate valid Last-Modified and Expires headers. Handle If-Modified-Since requests. Ensure that the time is set correctly on the server machine, e.g. via NTP [4], so that the timestamp information carried in the HTTP headers makes sense. Encourage the sharing of links to common graphics and applets, so that only one URL is used for a given object. Use client-side imagemaps (USEMAP - [5]) where feasible, since server-side imagemaps generate HTTP Redirects which are typically uncacheable. Use applet and scripting technologies such as Javascript or Java instead of CGI for form validation, where feasible. Use trailing slashes (/) for directory names to avoid extra redirects. If you use cookies, try to restrict them to the portions of your server where they're essential, since objects returned with a Set-Cookie header are uncacheable. Be aware that cookies may not interact well with proxy cache severs. Try to use a single name for a server in the hostname part of the URL - both in the anchors of your html and when using your browser. 3.2 Things to avoid, except where strictly necessary Don't use CGI programs which generate uncacheable results. Don't parse USER_AGENT to switch on browser capabilities, since the cached HTML will be browser specific. Use features like instead. Don't use server-side includes unless your server can send the Last-Modified HTTP header with them. Don't use redirects, since their results are uncacheable. Don't use secure servers to serve images and other non-sensitive objects, since these will be uncacheable and may not be passed through a cache hierarchy. Don't rename files to age them - give them unique names in the first place and update the links which point to them. Don't set the objects your server returns to expire immediately, or at some time in the recent past, unless you want to be held up to public ridicule! Don't use content-negotiation until HTTP 1.1 is more widely deployed, since in HTTP/1.0 it interacts badly with proxy caches. Avoid specifying port 80 in the URL, e.g. when generating URLs programatically. Don't use numeric representation of server address in urls if you have a choice. Don't use server modules or scripts to convert document's character set on the server side. Leave it to the client. 4. Security considerations Cache-busting is clearly justified in those cases where the use of caching has, in itself, security and privacy implications. Proxy servers tend to subvert firewalls and access controls based on IP addresses and/or domain names. 5. Acknowledgements Thanks to Duane Wessels, Vinod Valloppilli, George Michaelson, Donald Neal, Ernst Heiri and Wojtek Sylwestrzak for their contributions and comments on previous versions of this document. 6. References [1] A. Luotonen and K. Altis, "World-Wide Web proxies", In WWW94 Conference Proceedings (Elsevier), 1994. [2] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068 (Proposed Standard), 01/03/1997. [3] W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms", RFC 2001 (Proposed Standard), 01/24/1997. [4] D. Mills, "Network Time Protocol (v3)", RFC 1305 (Proposed Standard), 04/09/1992. [5] J. Seidman, "A Proposed Extension to HTML: Client-Side Image Maps", RFC 1980 (Informational), 08/14/1996.