HTTP Caching Model?

Consider this scenario:

On a server S, a document D is available as plain text, HTML, or

Client C1 is configured to only accept HTML. This client requests
document http://S/D via proxy P as:

	GET http://S/D HTTP/1.0
	Accept: text/html

Proxy P connects to S, requests the document, caches it, and returns
it to C1.

Client C2 is configured to accept postscript and HTML, and to prefer
postscript over HTML. It requests the same document through the same

	GET http://S/D HTTP/1.0
	Accept: text/html; q=0.5
	Accept: application/postscript; q=1.0

Proxy P receives the request, and notices that it has http://S/D in
its cache, so it returns the cached copy.

Note that had C2 requested the document straight from S, it would
have got postscript. But it got HTML from the proxy.

To me, this looks like the caching performed by P is not transparent,
and hence violates the protocol.

OK, ok, so currently nobody uses format negociation, and certainly
nobody implements the q and c parameters on accept headers (except
probably the CERN linemode browser and server).

But some information providers are using, of all things, the
User-Agent field to customize their documents: they server up
different stuff for MacMosaic, WinMosaic, Netscape, etc.

Certainly broken proxy caching is observable in these circumstances.
(but in this case, I'd say the fault is at the informatino provider
for abusing User-Agent this way, not at the caching proxy.)

One way to correct the behaviour of proxy P above is to base the cache
on not tjust the URL in question, but also include all the request
headers in the cache key.

But clearly this is way too conservative.

It seems to me that the HTTP protocol spec should specify which
request headers can affect the returned data, and which are just
"advisory." A correct cache would key on the URL plus all the request
headers which are allowed to affect the returned data.

For example, authentication headers shouldn't affect the returned
data.  User-Agent shouldn't affect the retuned data. (The fact that it
does is a wart that we'll have to deal with somehow.)

It means that introducing new headers that can affect the returned
data (like the recently proposed Accept-Charset: header) can't be done
with correct backwards compatibility. It might be wise to say that all
headers matching Accept-*: are allowed to affect the returned data.

Also... I haven't carefully reviewed the latest HTTP/1.0 spec: does it
include some specification of what is going on when a client requests
ftp://host/path or gopher://host/path via an HTTP proxy? Does it
discuss correct vs. heuristic caching in these cases?

Food for thought...


Received on Monday, 12 December 1994 10:05:07 UTC