Re: HTTP/1.1 draft 12 aug 1996 and content encodings from Nicolai Langfeldt on 1996-09-15 (ietf-http-wg@w3.org from July to September 1996)

From: Nicolai Langfeldt <janl@ifi.uio.no>
Date: Sun, 15 Sep 1996 22:31:21 +0200
To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <199609152031.WAA20771@ifi.uio.no>
Koen Holtman <koen@win.tue.nl> keyed:
...
> An empty Accept-Encoding value indicates none are acceptable.
...

You are right.  The type of this paragraph is clearly too small for me
:-)

Larry Masinter <masinter@parc.xerox.com> keyed:
...
> > * Secondly about the usage of the Content-Encoding header.  I have
> > seen, in various places, that the correct Content-Encoding for a file
> > named, index.html.gz should be 'gzip'.  At first glance this is
> > reasonable.  But in a content negociation context it's confusing and
> > results in needles complexety.
...
> Your two scenarios are not mutually exclusive.
...

Are both scenarios correct/intended usage of the Content-encoding
header?

> If you have a file index.html.gz which is gzipped HTML, and you
> deliver the data without transformation, then the result should be
> labelled 
> 
> content-type: text/html
> content-encoding: gzip
> 
> no matter whether you're returning the data as a result of requesting
> "index.html.gz" or "index.html". The correct behavior in the client is
> to interpret the data returned according to the content- labelling of
> the result.

The decitions of a information presentation client like netscape is in
this case easy.  For a retrive and store client the correct decoding
and parsing process is likewise easily determined.

> If a client wants to save the results to a disk file, the client might
> want to make up a convenient file name; however, the file name it
> makes up need not look like the URL that was requested. Presumably,
> though, "save to file" would save the contents AFTER it was unencoded.
> IF you have local conventions that text/html files are saved with
> ".html" at the end, THEN you might want to process the URL in order to
> generate that as a sample file name.

In the case of a automated retrive and store client we now have a
_hard_ problem.  I will atempt to explain.

Automatic copying though http (as w3mir does) have some legitimate
uses:
- Fast private reading off disk.  This needs to preserve html link
  integrety.  The filenames and encodings used for this DO matter.
- Copying of entity hierarchies from one server to another.  Since
  servers use filenames to determine content and encoding of documents
  this DOES matter.
- Priming cache servers.  There is no problem with that in this
  context.

It would be nice if clients made for this purpose is also served by
the http specification, since I think this class of clients will be
more and more important.

Koen Holtman and Larry argue that knowledge of local filename
conventions should be used to determine what the Right Thing is.  This
is a complicating strategy, and IMHO a mess.  For a client like w3mir
you can manage to keep things simple since they will only need special
knowledge about html files, and if something is HTML can easily be
determined since they start with <!DOCTYPE HTML...>, and if they don't
we can edit them after retrival, before saving to disk.

So I disagree with this, because it complicates things, and is
useless.  Consider the scenarios again:

1. > GET index.html.gz
   < Content-type: text/html
   < Content-encoding: gzip

   This is _only_ usefull for simple decoding in the client.  But it
   requires knowledge of filename conventions if you want to determine
   if the server has done content negociation or not, and wether to
   save the document encoded or decoded to disk, in a automated
   manner.

   The only purpose of this functionality is to save disk-space and
   being able to put <a href="index.html.gz"> in your docs.  But this
   usage essentialy means that content-encoding negociation (and
   associated headers) is unneeded, and a redundant part of the spec;
   given the existense of content-encoding negociation this is a
   useless usage, serving only to complicate clients.  I will try to
   justify this claim in my discussion of scenario 2.

   In a majority of all cases a client will only request a .gz file
   when a gz file has been provided in the server namespace for fast
   transfer of postscript or other documents of nontrivial size, like
   the HTTP spec.  I have _never_ seen this used for html (not that I
   surf particularly much).  So, easy decoding in this scenario would
   seem to be a non-issue.

   Additionaly, the client _did_ ask for index.html.gz, and the server
   did _not_ apply any encoding the client did not ask for implicitly.

   In conclusion: Forbiding this use will _not_ break anything, and it
   will simplify some of the less complex clients.


2. > GET index.html
   < Content-type: text/html
   < Content-encoding: gzip
   
   This makes sense.  Here the file index.html.gz might exist on disk
   and is served, or the site might have plenty of CPU and little
   bandwidth, and prefers to gzip documents if the client can handle
   it.

   This means that there is no need to refer to index.html.gz in
   scenario 1, because the server can mix and match encodings as
   needed based on CPU, diskspace, bandwidht or whatever other
   considerations needed.  The only remaining reason to refer to
   .gz files is files like draft-ietf-http-v11-spec-07.ps.gz
   which, we in fact mostly want saved on disk or printed anyway,
   not decoded and shown on screen right away since they're so large
   and impractical to read on a screen.

   Here the server _did_ apply a encoding not asked for implicitly,
   and the Content-encoding _does_ make 100% sense.  It does not
   complicate automatic retrive and save clients either, because there
   is no need for knowledge of local filename conventions.  You just
   decode and save as the basename of the requested file.

Furthermore; allowing both scenarios complicates things too, because
determining if you are faced with scenario 1 or 2 requires knowledge
of filename conventions and suitable heuristics.  It is true, though,
that this is the same knowledge needed to determine if it's correct to
save index.html.gz encoded or decoded in scenario 1.

So, to conclude, I think that:
- Scenario 1 serves no purpose and requires higher complexety for
  correct decitions in automated retrive and store clients.
- Scenario 2 is the Right Thing, and should be the only allowed
  scenario, Content-Encoding should _only_ be used when content-encoding
  negociation has been done.

Regards,
  Nicolai Langfeldt
Received on Sunday, 15 September 1996 13:35:21 UTC