Re: cache-busting document from Andrew Daviel on 1997-06-10 (ietf-http-wg@w3.org from April to June 1997)

From: Andrew Daviel <andrew@andrew.triumf.ca>
Date: Tue, 10 Jun 1997 00:56:14 -0700 (PDT)
To: http-wg@cuckoo.hpl.hp.com
Cc: ircache@nlanr.net
Message-Id: <Pine.LNX.3.95.970609233720.31571B-100000@andrew.triumf.ca>
As much of the cache-busting document is based on some of my Web pages,
I'll try to address concerns so far in an "omnibus" reply.

While this material has been around for some time, it hasn't attracted
much comment from knowledgeable folk - who presumeably go direct to the 
spec.s (or wrote them) ...

Larry Masinter wrote:
> I'd like to see more analysis (or references to it) associated
> with each individual piece of advice.

Probably a good idea; much of it is guesswork, though one can glean a
certain amount from proxy cache statistics.

> I've not seen any studies ... my guess that, after such documentation,
> you'd have more specific advice than 'use HTTP/1.1'.

It is clear from the RFC that the designers have given some thought
to hierarchical cache requirements, introducing Cache-Control elements
such as "private", "Max-Age", etc. I have seen reports at W3 indicating
a significant performance boost from using HTTP/1.1 over 1.0.

> When is it feasible (to use Expires headers) Do sites with planned
> expiration set expires dates?
> Is it feasible to, for example, declare that '/images' at a site
> never changes 

I do. I happen to use a script to modify a .meta side file (which worked
under Apache 1.1), and also to generate Expires in CGI, but Apache 1.2
has support for generating Expires from the .htaccess file, per-directory
or per-file, allocating a maximum age either since the page was modified
or since it was accessed. It is quite feasible to award images
an expiry date a year in the future while text lasts a week, a day
or an hour.
There are many questions asked about "how do I prevent pages being
cached". Much of the time, what the author really wants is not to make
it uncacheable but to ensure that a user gets "todays page", not
yesterdays.

> >   Use an HTTP server which supports the GET ... with If-Modified-Since
> Don't they all? At least for files?

I would hope so. However, if someone writes a custom database interface
they might forget to handle IMS, even though the database entries
may have file-like properties (a meaningful last-modified date).
There are servers that don't understand HEAD, that return HEAD as if it
were GET, and I once found one that served illegal Date fields. Don't
count on anything.

Ben Laurie wrote:
>  Changing all the references
> would be onerous, though - unless it was done by server-side parsing
> (yech).

I've occasionally done this with something like
"find /usr/htdocs -name *.html -exec fix-it.pl {} \; "
using in-place editing in Perl.

Martin Hamilton wrote:
> >Don't use redirects, since their results are uncacheable.
> 301 is cacheable, 302 is not. Use what you need.

When I wrote this, no-one used 301. A redirect isn't a big hit, anyway,
but if the net's truly bad it might make a difference where a 
hierarchy cache might otherwise serve a cached page without checking it.

>>   Don't use content-negotiation until HTTP 1.1 is more widely
>>     deployed, since in HTTP/1.0 it interacts badly with proxy caches.
>What am I supposed to use until then?

Not many people do, anyway. Doing it via a redirect is a compromise;
it works, the pages themselves are cacheable, but it requires two
requests not one. Not a big deal, perhaps, as http://some.org/xxx
requires 2 to  http://some.org/xxx/index.html's one.
I've used the Apache action module to assign a certain file suffix
to negotiated pages - xxx.lang launches CGI to redirect to xxx.en.html,
xxx.fr.html, etc. depending on Accept-Language.

> > Don't use server modules .. convert document's character 
> >    set on the server side. 
> What if the client can't do it? 

OK OK, do what you have to. Given the choice, though, it's preferable
not to manipulate the content based on user-agent unless some thought is
given to cacheing - perhaps redirecting MSIE users down one leg, Netscape
Winxx down another, Netscape X11 down a third. 

David W. Morris wrote:
> Sorry, it is the client's resonsiblity to declare what is is capabile
> of. ...

Roll on feature negotiation !
Meanwhile ... someone negotiates a page based on MSIE with 24-bit colour.
Next guy to hit the cache has Netscape on X11 with 8-bit ....
How to fix this - redirects, I guess. The easy way out is just to set
Cache-Control: private or Pragma: no-cache and bypass hierarchical cache.

Wojtek Sylwestrzak wrote:

> Unfortunately most of the servers practicing this today
> try  to perform a 'naive' content negotiation, which effectively
> uses redirects to other urls. This is of course wrong,
> because it unnecessarily expands the url addressing space,
> thus making caching less effective.

I don't think so ... If I have A.var, which redirects to 
A.en.html, A.jp-jis.html, A.jp-eu.html, A.fr.html I have one
small uncacheable redirect, and 4 cacheable documents. The 4 documents
are all different, and have distinct URLs, so are cached independantly.
There is the question of what a spider sees ... an agent without 
Accept-Language may get an HTML list of the separate pages, so they
get indexed separately, which is fine, except that the search engine
result points to the final page, not the original negotiation script.
(my proposal draft-daviel-metadata-link-00.txt addresses that )

> From the caching point of view it would be a very good practice
> for the clients to request/expect a single, standard charset
> for a given language (considered being a 'transport' charset). 

Nice idea; pity everyone's platform uses different coding :-(
(shift-jis, jis, euc-jp; koi-8, 8859-5, Windows-xxx etc etc.)
I think in some cases DOS, Windows, X11 and Mac are all different.
Unicode may help, but I hear it's not perfect either (missing some
charsets, 2 bytes required instead of one in many cases ..)

Shel Kaphan wrote:

> >   Don't use secure servers to serve images and other non-sensitive
> >     objects, since these will be uncacheable and may not be passed
> >     through a cache hierarchy.
> > 

> Not a good recommendation:  some browsers will put up a dialog box
> whenever there's a reference from a secure page to a non-secure page,

I don't have a  tame https server to play with and hadn't realized.
I've modified the original document.

In common I suspect with many of you when I access my banking services on
the Web I want to get on and do the job at least as fast as on a
touch-tone phone, not wait for a lot of background images, adverts, icons
etc. to download over my phoneline. It seemed daft to serve these
images from the uncacheable https channel. As it is, I turned on
cache for https in Netscape. If someone gets root access to read my 
cache files, they can snarf my passwords and credit card numbers
right out of /dev/kmem. ... of course, they could just check the trash ...

Andrew Daviel
TRIUMF & Vancouver Webpages
Received on Tuesday, 10 June 1997 00:55:56 UTC