Re: Content encoding problem... from Jeffrey Mogul on 1997-02-20 (ietf-http-wg@w3.org from January to March 1997)

From: Jeffrey Mogul <mogul@pa.dec.com>
Date: Wed, 19 Feb 97 16:34:02 PST
To: jg@zorch.w3.org
Cc: http-wg@cuckoo.hpl.hp.com
Message-Id: <9702200034.AA01526@acetes.pa.dec.com>
There seems to be some confusion surrounding the issue of end-to-end
data compression, which was probably partly created because Jim
forwarded a response to a message I sent, and the response didn't quote
my entire original message.  I'll try to clarify what the problems
are.  [Warning: long message follows.]

Jim and Henrik have argued that there is compelling evidence that
end-to-end data compression is sufficiently useful that we should not
wait for HTTP/2.0.  I believe that there is also a good case to be made;
I'd prefer to discuss this offline with anyone who doesn't buy
Jim and Henrik's argument.

I also believe that HTTP/1.1 can offer exactly what is needed, once a
few *minor* problems are resolved.

First, for concreteness, suppose that we discover that a new
compression algorithm, say "zipflate", is better than either gzip or
compress.

There are three problems with RFC2068 that would prevent the most
efficient use of this compression algorithm, and that might result
in presenting users with bogus results:

	(1) The current specification of Accept-encoding *requires*
	(SHOULD-level, not MUST-level) a server to return an
	error response in a situation where this is probably
	not optimal.  This might lead to many extra round-trips,
	and might also lead to the destruction of otherwise
	useful proxy-cache entries.

	(2) The current specification of Accept-encoding *allows*
	a server to send a response using an encoding that the
	client software might not only not understand, but which
	it might improperly render to an unwitting user.

	(3) The current design allows an HTTP/1.0 cache to return
	an encoded response to an HTTP/1.0 client, in such a way
	as to cause the client to render garbage to an unwitting user.

The first two problems can be solved by a change to the specification
of Accept-Encoding.  The last problem can be solved by introducing a
new status code, analogous to the one used for Partial-content (e.g.,
byte-range) responses.  I'll elaborate below on each of these points.

The current wording in section 14.3 (Accept-Encoding) says:

   If no Accept-Encoding header is present in a request, the server MAY
   assume that the client will accept any content coding. If an Accept-
   Encoding header is present, and if the server cannot send a response
   which is acceptable according to the Accept-Encoding header, then the
   server SHOULD send an error response with the 406 (Not Acceptable)
   status code.

Here are some scenarios where this specification causes trouble:

Scenario #1:
	HTTP/1.0 Client sends no Accept-Encoding header.
	Server sends
		Content-encoding: zipflate
		Content-type: text/html
	Client renders garbage

Henrik's experiments apparently confirm that.

Scenario #2:
	HTTP/1.1 Client sends
		Accept-encoding: zipflate
	HTTP/1.1 Server without support for zipflate sends
		HTTP/1.1 406 Not Acceptable

In this case, it would almost certainly be more efficient for
the server to simply send the unencoded (identity) response,
if this is available, rather than forcing the client to try
again.  (See below for a proposal that allows the client to
explicitly say "send me nothing if you can't send me what I want".)

Jim's proposal is:

    If an Accept-Encoding header is present, and if the server cannot
    send a response which is acceptable according to the
    Accept-Encoding header, then the server SHOULD send a response
    using the default (identity) encoding; if the identity encoding
    is not available, then the server SHOULD send an error response 
    with the 406 (Not Acceptable) status code.

That solves the problem with scenario #2, but not with scenario #1.

I have three different proposals to solve these two problems,
in order of increasing distance from current practice (and in
order of increasing precision, I think).

The simplest change would be to say:

    If no Accept-Encoding header is present in the request, then
    the server SHOULD respond using one of
	o the default (identity) content-coding; or
	o the "compress" content-coding; or
	o the "gzip" content-coding
    It MUST not respond using any other content-coding.  If none
    of these content-codings is available, the server SHOULD send
    an error response with the 406 (Not Acceptable) status code.

	Note: the use of unsolicited compressed encodings may
	lead to confusing errors in rendering the response, and
	should be done with caution.

    If an Accept-encoding header is present, and if the server cannot
    send a response which is acceptable according to the
    Accept-Encoding header, then the server SHOULD send a response
    using the default (identity) content-coding; it MUST NOT send a
    non-identity content-coding not listed in the Accept-encoding
    header.  If, in this case, the identity content-coding is not
    available, then the server SHOULD send an error response with the
    406 (Not Acceptable) status code.

Actually, because the HTTP/1.1 spec does not explicitly require
a client to support any of the non-identity content-codings, it
seems smarter to use something like the following wording instead:

    If no Accept-Encoding header is present in the request, then
    the server SHOULD respond using the default (identity) content-coding.
    It MUST not respond using any other content-coding.  If none
    of these content-codings is available, the server SHOULD send
    an error response with the 406 (Not Acceptable) status code.

    If an Accept-encoding header is present, and if the server cannot
    send a response which is acceptable according to the
    Accept-Encoding header, then the server SHOULD send a response
    using the default (identity) content-coding; it MUST NOT send a
    non-identity content-coding not listed in the Accept-encoding
    header.  If, in this case, the identity content-coding is not
    available, then the server SHOULD send an error response with the
    406 (Not Acceptable) status code.

And, if we want to make it possible for a client to say "send me
a compressed encoding or send me nothing", then I'd propose this
pair of changes

(1) in section 3.5 (Content Codings), add this after the item
for "deflate"

	identity	The default (identity) encoding; the use
			of no transformation whatsoever.  This
			content-coding is used only in the
			Accept-encoding header, and SHOULD NOT
			be used in Content-coding header.

	An HTTP/1.1 client or server MAY support any of these
	content-codings, but SHOULD NOT assume (without explicit
	evidence) that any other client or server supports any
	content-coding besides "identity".

(2) the wording in section 14.3 would become:

    If no Accept-Encoding header is present in the request, then
    the server SHOULD respond using the default (identity) content-coding.
    It MUST not respond using any other content-coding.  If none
    of these content-codings is available, the server SHOULD send
    an error response with the 406 (Not Acceptable) status code.

    If an Accept-encoding header is present, and if the server cannot
    send a response which is acceptable according to the
    Accept-Encoding header, then the server SHOULD send an error
    response with the 406 (Not Acceptable) status code.

	Note: a client willing to accept either a compressed
	or uncompressed response should send, for example,

	    Accept-encoding: identity,gzip

	to allow the server to generate a response without
	wasting a round-trip.
	
This should solve problems #1 and #2.

====================

Now, on to problem #3.

Suppose one has this configuration:


                                           |--- HTTP/1.1 client A
                                           |
HTTP/1.1 server S ---- HTTP/1.0 proxy P ----
                              with cache   |
                                           |--- HTTP/1.0 client B

Now suppose that client A does
	GET http://S/foo.html HTTP/1.1
	Host: S
	Accept-Encoding: zipflate

via proxy P, which forwards it to server S, which responds with

	HTTP/1.1 200 OK
	Content-Encoding: zipflate
	Content-type: text/html
	Last-Modifed: .....
	Expires: .....
	Cache-control: .....

Proxy P caches this response and forwards it to client A.  So far,
so good.

Soon thereafter (before the Expires time), client B decides to issue its
own request for the same URL:
	GET http://S/foo.html HTTP/1.0

Since HTTP/1.0 proxy P doesn't understand "Accept-Encoding", as far as
I can tell, it's likely to return the cached response to B.  But client
B's HTTP/1.0 browser won't know how to render it.  If that software
is smart, it might re-issue the request with a "Pragma: no-cache".
But I doubt that any existing browsers are this smart, with the
result that B's user (e.g., my mom) would be faced with a mysterious
error message (or a screen full of garbage).

I suppose one could hope that HTTP/1.0 caches don't store responses
with a Content-encoding header, but I looked at the sources for
the CERN httpd, and it doesn't seem to pay any attention.

The HTTP/1.0 "specification" defines the "gzip" and "compress"
content-codings, but does not define "deflate", so it is reasonable
to assume that many (if not all) HTTP/1.0 clients and proxies do
not understand the full set of content-codings already specified
in HTTP/1.1, let alone anything new that comes along.

So I suspect that we will need to fix HTTP/1.1 to make it safe
to use "default", or other new content-codings, with HTTP/1.0
caches before any widespread deployment of these compression
algorithms could be contemplated.  

I propose that we add a new status code, analogous to 206 (Partial
Content), to be used on all HTTP/1.1 responses with a non-identity
Content-coding.  For example, 207 (Encoded Content).  This would allow
HTTP/1.0 caches to forward, but not to cache, the response; it would
allow HTTP/1.1 implementations to do whatever is appropriate.  (I.e.,
an HTTP/1.1 cache would have to check the Content-Encoding against the
Accept-Encoding of a subsequent request.)

Here's some proposed wording:

10.2.8 207 Encoded Content

   The server has used a non-identity content-coding for the response.
   The request SHOULD (MUST?) have included an Accept-encoding field
   including the name of content-coding used.  The response MUST include
   a Content-coding header specifying the content-coding used.

   A cache that does not support the Accept-encoding header
   MUST NOT cache a 207 (Encoded Content) response, except if the
   cache is able to convert it to the identity content-coding before
   using it in response to a subsequent request, and then only if
   the response does not contain the "no-transform" Cache-control
   directive.

This would prevent some HTTP/1.0 caches from storing "gzip"ed 
or "compressed" results, but it's not clear that there is much
of this happening today.  (Does anyone have proxy statistics that
show the fraction of cachable responses that have Content-coding
headers?)

-Jeff
Received on Wednesday, 19 February 1997 16:46:21 UTC