Re: Content encoding problem...

Roy's messages have been helpful.  I certainly understood that
browsers are not the only clients involved, but I hadn't really
made another important distinction.  Roy hints at it, but I don't
think he's made it clear (perhaps because he won't agree with me
that it exists).

If you bear with me through another long message, I think we
can actually specify "Accept-encoding" so that Roy and I are
both happy.

I think we can more or less agree on several things:

	(1) It's not good if any client tries to interpret
	the content of a response without realizing that it
	has been encoded (e.g., a browser rendering an HTML
	page, or an automated network manager that sets off
	alarm bells when something seems wrong).

	(2) It's also not good if a client that wants to
	interpret (e.g., render) a response, but realizes
	that it has been encoded in a way that the client
	doesn't understand, and the client *would* have
	been able to understand the identify encoding.
	
	(3) It's also not good if a server fails to send
	a response to a client because it's not sure if
	the client will be able to use it, and in fact
	all that the client wants to do is to make a copy
	of the server's bits.

Roy seems to grudgingly grant #1 and #2, when he writes:
    If a UA receives a response that includes a Content-Encoding value
    which it is incapable of decoding, then the UA should be smart
    enough to know that it cannot simply render the response.  There is
    no excuse for failing to anticipate extensions of an extensible
    field value.
There may be no excuse, but Henrik says that this happens, and we
need to face up to that.

I hadn't really realized the issue for #3, which Roy expresses as:

    For example, I have a save-URL-to-file program called lwpget.  It
    never sends Accept-Encoding, because there has never been any
    requirement that it should, and no HTTP/1.0 server ever needed it.
    Should lwpget be prevented from working because of the
    *possibility* that it might be a rendering engine that doesn't
    understand the encoding?

Roy and I also apparently agree that there is a distinction (which
has already been made in the past) between browser clients and
non-browser clients (such as lwpget or a mirroring system).  But
I think that the missing distinction is this one:

	Some clients interpret the bits of the response
	
	But some clients just copy the bits without interpreting them

An unknown (or unexpected) content-coding is a problem for
bit-interpreting clients (such as a browser), but it's not a problem
for bit-copying clients (such as a mirror or lwpget).

There's another distinction that we need to make:

	Some resources are "inherently content-coded"; they exist
		only in a form that requires decoding before
		most useful interpretations
	
	Some responses are "content-coded in transit"; a server
		or proxy has applied the encoding to a value
		that is also available as "plaintext"

Example of the first type:
    http://www.ics.uci.edu/pub/ietf/http/rfc1945.ps.gz

Example of the second type:
    http://www.ics.uci.edu/pub/ietf/http/rfc1945.html after the
	server (or some proxy) has passed it through gzip

With these distinctions in mind, I can now state what I believe
are useful goals:

	(1) a bit-copying client wants to have the server's default
	representation of a resource, whether this is encoded or
	not.  E.g., if server X is mirroring the contents of server Y,
	then the result (response body) of retrieving
		http://X/foo
	should be the same as the result of retrieving
		http://Y/foo

	(2) a bit-interpreting client needs to have, ultimately,
	the unencoded representation of the resource.  For example,
	if my browser retrieves an HTML file, then at some point
	it has to have an non-compressed version of this file before
	it can render it.

Now, these two goals are not inconsistent with applying encodings
(such as compression) at various stages.  For example, when a
bit-copying client that understands gzip retrieves an HTML resource
from a server that understands gzip, we would probably prefer
that the bits transmitted over the wire between these two are sent
using gzip compression, even if the mirrored result is decompressed
before anyone else sees it.

So here's what I think is the right solution:

	(1) If there is only one representation available
	at the server, or if the server's "normal" representation
	is encoded, then the server should send that representation.

	(2) If there are multiple representations, and the client
	does not specify which one it prefers (i.e., the request
	does not include "Accept-Encoding"), then the server should
	send the least-encoded representation available.

	(3) If there are multiple representations, and the client
	specifies (using "Accept-Encoding") that it is willing
	to accept at least one of these, then the server should
	send the "best" of these acceptable representations.
	
	(4) If there are multiple representations, and the client
        specifies (using "Accept-Encoding") a set of encodings
	that it is willing to accept, but there is no intersection
	between these sets, then the server should return "None
	Acceptable".
	
I think these rules satisfy both Roy's stated requirements and
mine.  That is, all of the existing clients will continue to
get the responses they get today, because they don't send
"Accept-encoding".  In particular, mirroring clients work exactly
the way Roy wants (by rule #1), and servers that optionally
compress responses before sending them won't do this to unsuspecting
HTTP/1.0 browsers (by rule #2).  However, rule #3 allows HTTP/1.1
clients and servers to agree to use any encoding that they choose,
no matter what is listed in the HTTP/1.1 spec.  (Presumably, the
encoding name should be listed in the IANA registry.)

I think this is a codification of what Roy meant when he wrote:
    It is the responsibility of the origin server to prevent [a browser
    rendering garbage] from happening by accident.  It is not possible
    to prevent [it] from happening on purpose, because attempting to do
    so breaks my (2).
I'm interpreting the bracketed [it] to mean "sending the server's
normal representation of a resource".
    
Roy might object to my rule #4, based on this:
    HTTP/1.1 browsers will have "Save As..." functionality, and thus
    it isn't possible for an HTTP/1.1 application to exhaustively list
    all accepted content-codings in an Accept-Encoding field for every
    type of GET request it will perform.

If one wants to be as aggressive as possible about using compression
(or other encodings) in such cases, there is the potential for needing
one extra round trip.  That is, the client can either send no
Accept-encoding at all, which (probably) will result in a
non-compressed transfer ... or the client can send an Accept-Encoding
field that lists a finite set of encodings it can handle, taking a
chance that none of these will be available at the server, and so
requiring one more round trip for the client to retry the request with
no Accept-Encoding header.

But this somewhat begs the question, because what does "Save As"
really mean when the server has a choice of encodings?  Does the
client want to save the decoded contents, or one of the encoded
representations?  Does this depend on whether the server's default
representation is compressed, or if the compression was applied
in flight?  These seem like UA questions, not protocol questions.
For example, Netscape 3.0 knows enough to gunzip
    http://www.ics.uci.edu/pub/ietf/http/rfc1945.ps.gz
before invoking a Postscript previewer on it, but "Save As" stores
it as a compressed file.

Regarding my scenario with the HTTP/1.0 proxy cache and HTTP/1.0
client, I still think this requires the use of a special status
code to prevent accidents (unwitting rendering of garbage).  Roy
can hope that people will replace their HTTP/1.0 proxies and
HTTP/1.0 browsers because "comes a point when we must recognize
the limitations of older technology and move on", but wishing won't
make it so.  (And I could have argued that Roy's lwpget program,
and existing mirror clients, should be upgraded, but I don't think
we should be making anything that works today obsolete.)

At any rate, on this topic, Roy write:
    this particular scenario only occurs if the URL in question has
    negotiated responses based on Accept-Encoding.  It is quite
    reasonable for the origin server to modify its negotiation
    algorithm based on the capabilities of the user agent, or even the
    fact that it was passed through a particular cache; I even
    described that in section 12.1.

I think it would be far simpler (and safer, because it's probably
impossible to enumerate the universe of User-Agent values) if the
server simply used my proposed 207 status code for "negotiated"
encodings.  I.e., if the server follows my rule #3, then it sets
207.  If the server is following rules #1 or #2, then there hasn't
really been a negotiation, and I suppose it makes sense to cache
the response.  Yes, in a world without any HTTP/1.0 proxy caches
one could rely on "Vary: Accept-encoding", but it's pointless to
expect all such caches to disappear any time soon.

By the way, Roy, when you write, re: my proposed 207 (Encoded Content)
status code,
	it breaks the distinction between the response status and
	the payload content, which would be extremely depressing
	for the future evolution of HTTP.
I really have no idea what you mean by this.  Perhaps you could
elaborate?

-Jeff

Received on Friday, 21 February 1997 18:47:24 UTC