Re: Report on HTTP Caching subgroup meeting (Feb 2 1996)

Last one...

> ----------------------------------------------------------------
> Issue: transparency vs. performance
> 
> Since there have been numerous discussions of whether semantic
> transparency or performance is the more important issue for HTTP
> caching, we tried to come to a consensus on what we believed about
> this.
> 
> Here is a rough summary of our consensus:
> 
> 	Applications in which HTTP is used span a wide space
> 	of interaction styles.  For some of those applications,
> 	the origin server needs to impose strict controls on
> 	when and where values are cached, or else the application
> 	simply fails to work properly.  We referred to these
> 	as the "corner cases".  In (perhaps) most other cases,
> 	on the other hand, caching does not interfere with the
> 	application semantics.  We call this the "common case".
> 	
> 	Caching in HTTP should provide the best possible
> 	performance in the common case, but the HTTP protocol MUST
> 	entirely support the semantics of the corner cases, and in
> 	particular an origin server MUST be able to defeat caching
> 	in such a way that any attempt to override this decision
> 	cannot be made without an explicit understanding that in
> 	doing so the proxy or client is going to suffer from
> 	incorrect behavior.  In other words, if the origin server
> 	says "do not cache" and you decide to cache anyway, you
> 	have to do the equivalent of signing a waiver form.
> 
> 	We explicitly reject an approach in which the protocol
> 	is designed to maximize performance for the common case
> 	by making the corner cases fail to work correctly.

Let me again say that I adamantly oppose this decision.  It doesn't
reflect any of the applications that currently use HTTP, it is a mythical
invention of the subgroup that such a thing is even desirable in all
cases, and does a poor job of satisfying the user's needs.

The reason that user agents are not always semantically transparent is
because the user does not always want them to be semantically transparent.
No matter what is in the protocol, no decision by the WG will ever
change this fact of life.  It is therefore WRONG to require in
the protocol what cannot be achieved by any application -- all you
are doing is requiring applications to be non-compliant.

What you want is to enable the protocol to say "this is what you have
to do to remain semantically transparent" and then require that
applications default to semantic transparency mode.  The former is
what Cache-control does, and the latter can be added to the text.

What we cannot do is control the user's application of HTTP technology;
attempting to do so is foolish and contrary to the design of the Web.
Requiring a visible/noticeable warning be presented when semantic
transparency is disabled is reasonable, provided that it does not
actively interfere with people's work.

> Shel adds:
>     At the meeting we discussed whether we believed it would be
>     reasonable or advisable to think about adding protocol elements to
>     explicitly control history mechanisms, such as (for instance) a
>     directive to prevent the entry of a document into a history
>     buffer.  Surprisingly (as I had thought the consensus was
>     otherwise) people seemed to agree that it was reasonable to
>     consider some options in this area. We didn't discuss it further.

I am surprised as well.  When it was discussed on the WG, there was
near universal agreement that the server has no business messing with
the browser's history function.  The only thing I would ever find
acceptable in that regard would be a "don't write this data to disk"
statement for secure applications, but the fact is that those
applications will already have a better mechanism for saying such a
thing as part of the security negotiation.

> ----------------------------------------------------------------
> Issue: Dates in If-Modified-Since headers
> ...
> I can't remember if we discussed one additional related point,
> or if this came up in another context and I'm confusing the
> recollections: if the server receives an If-Modified-Since:
> containing a date that is later than the actual modification
> time of the resource, should it return 304 Not Modified or
> should it treat this as a validation failure?

It should return 304 unless the date is clearly invalid (e.g., later
than the server's internal clock).  Clients use IMS in this way
on purpose to obtain the equivalent of max-age on requests.

> ----------------------------------------------------------------
> Issue: Cache hierarchies and bypassing
> 
> At one point during the day, K Claffy brought up an issue that
> none of the rest of us had even considered, but we agreed ought
> to be taken seriously.  This is the problem of how a hierarchical
> cache, such as is being used in New Zealand, can optimize retrieval
> latencies for non-cachable resources.
> 
> In the New Zealand case, since they have a very limited-bandwidth
> connection with the rest of the world, they use a national cache
> to avoid overloading this international link.  They also use
> additional caches scattered around the country, which normally
> go first to this national cache but are able to bypass it to
> go directly to an overseas origin server if the national cache
> isn't expected to have the appropriate cache entry.

Yep. [people should know, BTW, that I am half-Kiwi myself and much of
      the protocol has been designed according to the requirements
      identified by these hierarchical caching schemes]

> For example, if a client does a GET with a normal (non-"?") URL,
> the request flows up the cache hierarchy because the responses
> to GETs are normally stored in caches.  However, a POST is sent
> directly to the origin server, because there is no point in
> routing it through the cache hierarchy (there being no chance
> in today's HTTP/1.0 world that the caches would be helpful here).
> 
> In order to do request-bypassing in the most efficient possible
> way, the caches have to be able to determine from the request
> whether the response is likely to be cachable.  (I would assume
> that it is important to err on the side of assuming cachability,
> since the converse could seriously reduce the effectiveness of
> the caches.)

Yes, that is the only choice that can be made aside from building
tables of previously uncachable URLs, which doesn't scale.

> We didn't come up with a good solution to this problem in general
> (i.e., for GETs whose responses are not cachable, or for other
> methods whose responses *are* cachable) but there was some
> brief discussion of the proposed "POST with no side effects"
> method.

POST with no side effects doesn't help either, since there is a near-zero
chance of sustaining an adequate number of repeated cache hits in
order to justify keeping the POST data in the cache key.

> DEFERRED ITEM: what to do about bypassing?

I do not consider it a protocol issue.  The client is capable of making
that decision prior to making the request, so it is not necessary to carry
that information within the protocol.

> ----------------------------------------------------------------
> Issue: Extensibility
> 
> Jeff raised the issue of whether and how we could provide some
> mechanisms in HTTP/1.1 that would allow HTTP/1.1 caches to do
> something sensible with HTTP/1.2 (and later) protocols if new
> methods were introduced after HTTP/1.1.  In other words, even
> though we do not now know what those new methods would be, can
> we figure out how to describe what they mean with respect to
> caches so that we can separate this from the rest of the semantics
> of the new methods?

That is what cache-control is for -- if we think of anything that
can't be done through cache-control, then that thing should be
added to cache-control.

> This quickly led into a lengthy philosophical discussion of
> caching models, led by Larry Masinter.  I'll try to summarize
> how far we got, although we did not reach any closure on this.
> 
> Larry described possible three ways to view an HTTP cache:
> 
> 	a) a cache stores values and performs operations on these
> 	values based on the requests and responses it sees.  For
> 	the purposes of the cache, one can describe each HTTP
> 	method as a transformation on the values of one or more
> 	resources.
> 	
> 	b) a cache stores responses, period.
> 	
> 	c) a cache stores the responses to specific requests.
> 	The cache must be cognizant of the potential interactions
> 	between various requests; for example, a PUT on a resource
> 	should somehow invalidate the cached result of a previous
> 	GET on the same resources, but a POST on that resource
> 	might not invalidate the result of the GET.
> 
> Nobody wanted to defend view (b); it was clearly insufficient.
> 
> Larry prefers view (c), mostly (if I recall right) because it
> seems to fit best with a simple model of content negotiation.

(c) is what I use as well.

> Jeff favors view (a), because it ultimately seems (to me) to
> allow a more straightforward description of what each method
> means to the cache.  In particularly, view (c) seems to require
> describing O(N^2) interactions between the N methods.

That requires the cache to know about all the effects of a particular
action upon the origin server and any other resource controlled by
that origin server.  In other words, it is an impossible position.

Your statement about O(N^2) interactions is just wrong.  GET and HEAD
are the only methods for which caching is a default.  All other methods
(in HTTP/1.1) may have caching characteristics applied to them via
the Cache-Control header field.  In practice, however, there has been
no desire to cache anything other than GET responses, since no other
requests are significantly likely to result in a later cache hit.
Any method other than GET or HEAD also causes a cache flush because
that is the only safe thing to do after a non-idempotent action.

> The fact that we could reach agreement on a lot of other issues
> without having any kind of agreement on this particular debate
> suggests that either one's choice between views (a) and (c) does
> not have much effect on the solutions to those issues, or perhaps
> that the "proper" view is some hybrid.
> 
> Getting back to extensibility, if we followed view (a), we could
> perhaps describe the cache-related consequences of new (post-HTTP/1.1)
> methods by some generic request and response headers that the caches
> could obey without understanding the methods themselves.  For example,
> these hypothetical headers could tell the cache to save (or not save)
> the response value, or to make stale ("invalidate") any previous
> cached values associated with the URL mentioned in the request, or
> one or more URIs mentioned in the response headers.  It seems
> somewhat trickier to do a similar thing for extensibility if one
> follows view (c).

Why don't you just define the default on other methods to be a 
cache flush, and then allow

    Cache-control: idempotent

on the request to identify a method that doesn't need a flush.
Note, however, that you'll still need to tell the cache what it
should use as the cache key, and that is not a trivial problem.

Overall, I think it is best to say that all unrecognized methods default
to a cache flush and then allow the cache to behave "as appropriate"
for those methods that are known to that cache.  In fact, I think
that's what I wrote in the HTTP/1.1 draft.

> Shel adds:
>     We did mention that if requests on a URI invalidate the cached
>     responses to other requests (different methods) on the same URI,
>     (with possibly a couple of exceptions), then the N^2 problem goes
>     away -- i.e. you assume a simpler model where the cache doesn't
>     pretend to know the semantics of the methods.  The "short cuts" we
>     discussed were that GET and HEAD don't have to invalidate any other
>     entries, and that the body of a PUT request might be used to
>     service later GET requests.
> but Jeff notes that assumes that a new method does not affect the
> cachability of a resource other than those mentioned in the request
> URL, any request URIs, any response URIs, any response Location:
> headers, etc.  I.e., we would have to be careful about any new
> headers that could identify "cache consequences" of a new method.

It doesn't need to be a perfect solution.  In fact, it isn't necessary
to do any cache flushing -- it is only a convenient optimization.

> ----------------------------------------------------------------
> Issue: PUTs and POSTs
> 
> There was some discussion about caching the results of POSTs,
> and/or the bodies of PUTs, as examples of how the current
> GET-only caching model could be extended.  That is, we discussed
> these as stand-ins for hypothetical future methods while discussing
> the general problem of extensibility.  We did not have time to
> fully discuss caching for PUTs and POSTs.
> 
> DEFERRED ITEM: caching of responses to POSTs

Not useful due to hit probability and storage requirements.

> DEFERRED ITEM: caching and PUTs

Never desirable.

> ----------------------------------------------------------------
> Issue: Content negotiation
> ...
> 
> Back several paragraphs, I mentioned a second key point:
> 
>    (2) How does the origin server know if the cache already
>    holds the response that it should return (and hence we
>    could avoid actually transferring the body of the response)?
> 
> We talked about a number of possible solutions for this.  For example,
> a cache could annotate its request to the origin server with all of the
> information it has about the variants it already has in its cache,
> and let the origin server decide.  But that seems to require huge
> amounts of request headers on all requests, which somewhat defeats
> the purpose of occasionally not sending a response body.
> 
> Someone also proposed tagging variants with unique URIs.  However,
> this might not work for some kinds of content negotiation.

There do not exist any forms of content negotiation for which that
would not work.  That is provable, since any Request-URI can have
some form of validator attached to it to make it unique for the
entity in the response, and then that URI would be included in the
Content-Location of the response.

> After some discussion, Jeff proposed a "variant-ID" mechanism that
> would provide a compact (and optional) way of communicating between
> cache and server what variants the cache already held.  Someone
> else suggested that this also needed to include cache validators
> for each of the variants, so that the necessary response *would*
> be returned if the cached copy was no longer valid.

In that case, use Content-ID and stop inventing 8 new solutions just
to replace it.

> It was agreed that this scheme had the advantage (over URI-tagging)
> that there was little or no chance that the tags could leak out into
> other contexts (i.e., nobody would try to use them instead of a
> proper URL).

That is not an advantage.

> I described this in more detail today in
> 	http://www.roads.lut.ac.uk/lists/http-caching/0254.html
> 
> ACTION ITEM: Paul Leach willing to write up a proposed spec with
> Jeff's help.  Larry Masinter willing(?) to integrate this with other
> content-negotiation stuff.

We can discuss this at the LA IETF.  Larry, please put on the agenda
that I would like to discuss why we should use Content-ID for opaque
validation and variant identification if such things are necessary.

> ----------------------------------------------------------------
> Issue: Security
> 
> We started by listing five security-related issues:
> 
>    Authentication
>    Proxy-authentication
>    Privacy
>    Spoofing using location headers
>    Data integrity
> 
> We did not succeed in resolving all five of these.
> 
> We seem to believe that data integrity is an end-to-end issue;
> caches should not be checking or computing MD5 (or other)
> integrity checks, or changing any aspect of the requests
> or responses that would be covered by such checks.

That is silly. I agree that the cache should not interfere with
the response during the initial transaction, but it most certainly
should do an integrity check (if any are available) prior to storing
the response for later use, and should not store anything that failed
an integrity check.

> Shel led a short discussion of the Location: header problem,
> which mostly boiled down to a plea not to do anything stupid
> in the protocol.  Shel Kaphan has summarized this in
> 	http://www.roads.lut.ac.uk/lists/http-caching/0252.html
> 
> ACTION ITEM: Shel Kaphan to write necessary paragraphs for the
> part of the HTTP/1.1 spec that covers Location:.

Keep in mind that it should be Content-Location for what is currently
being called "Location in a 2xx response".

 ...Roy T. Fielding
    Department of Information & Computer Science    (fielding@ics.uci.edu)
    University of California, Irvine, CA 92717-3425    fax:+1(714)824-4056
    http://www.ics.uci.edu/~fielding/

Received on Tuesday, 20 February 1996 22:52:00 UTC