Re: Comments on the HTTP/1.0 draft. from Roy T. Fielding on 1994-12-02 (ietf-http-wg@w3.org from October to December 1994)

From: Roy T. Fielding <fielding@avron.ICS.UCI.EDU>
Date: Thu, 01 Dec 1994 17:58:13 -0800
To: Marc VanHeyningen <mvanheyn@cs.indiana.edu>
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <9412011758.aa00432@paris.ics.uci.edu>
Marc VanHeyningen writes:
>[I wrote:]
>>The specified behavior will be "no canonical encoding of the object-body
>>is required before network transfer via HTTP, though gateways may need
>>to perform such canonical encoding before forwarding a message via a
>>different protocol.  However, servers may wish to perform such encoding
>>(i.e. to compensate for unusual document structures), and
>>may do so at their discretion."
> 
> I must not be understanding what you're saying correctly.  Why is
> canonical encoding unnecessary?  Do you really mean that any server,
> on any architecture, can (for example) transmit text files using
> whatever its local system convention for line breaks might happen to
> be (CR, LF, CRLF, whatever) without standardizing it?  How can we be
> passing local forms around between different machines and expect it to
> work reliably?

Yes.  Because (except in very few circumstances) it does work reliably.
I do not know of any server that does canonicalization.  Requiring ALL
servers to parse-and-replace, character-by-character, all text/* content
types is hideously inefficient and not appropriate for HTTP.  Instead,
that decision (of whether or not its needed) should be left up to the
individual platform implementation.

> Yes, I know that pretty much all existing servers run under UNIX and
> just blindly send the UNIX line break without making any effort to
> normalize it, but the spec should document correct behavior, with
> existing behavior mentioned, as it currently is, in the appendix.  The
> current document is a little strange, in that the appendix recommends
> assuming any newline is a line break to tolerate bad servers/clients,
> but nowhere in the document does it seem to say what the *correct*
> behavior is, or why those programs are bad.  I believe strongly that
> the correct behavior is to send things only in canonical form.

The alternative is to specify that lines end in LF, and I don't like
that any better.  However, I agree that something should be said in the
spec regarding canonicalization.

> Actually, after thinking about this a little more, I realized the MIME
> encoding model isn't adequate, because HTTP adds a new layer of
> encoding ("Content-Encoding: x-gzip" or the like) and the spec needs
> to explicitly state when that encoding gets done and undone relative
> to canonicalization (if we include that) and CTE.  I think the model
> should specify that content-encoding happens after canonicalization
> but before the CTE, if any (there should be none, of course, normally)
> is applied.

Yes, that should be clarified.

> (Actually, I wouldn't object to outright prohibiting any CTE other
> than clear ones like 7bit, 8bit and binary, but maybe there are
> reasons to allow q-p and base64.)

Clients may wish to support others in order to post newsgroup messages
through a proxy, but that is the only case I can think of.

>>> As near as I can tell, the spec constrains all header values to be
>>> US-ASCII, meaning nothing that is not US-ASCII may be contained in
>>> them.  We might consider permitting non-US-ASCII information in at
>>> least some headers, probably using RFC 1522's model.
>>
>>I'd rather not.  If there is a perceived need for non-US-ASCII information
>>in header field value text and comments (I don't see any), then I think
>>they should only be encoded by gateways during export.
> 
> I don't see an immediate overwhelming need, but it's there.  Plenty of
> people who have names with non-ASCII characters in them like to
> include those names in a From: header, for example.  It's not
> necessarily urgent, but I think it will get used anyway in areas like
> From: that are only used for logging and shouldn't break anything.
> It's not as though it would be legal in URLs or Content-Types or
> anything heavily interpreted, and I don't think it would break any
> existing software.
> 
> I don't see how having them only be encoded by gateways will suffice.
> How would one represent non-US-ASCII information in a header?
> Specifically, how would one indicate what character set is being
> employed, if not by using MIME part 2?  (Yes, it's kind of an ugly
> wheel, but it works and it's backward-compatible.)

I suppose it can be allowed for *text and *ctext.

> Oh, minor nit:  In the date section, the grammar makes the
> Day-of-the-week component of mandatory.  I believe it should be made
> optional, at least in 822/1123 style, since that's how it is in 822
> (not to mention there's no good reason for it to be there, since it
> doesn't provide any machine-useful information and won't normally be
> viewed directly by a human.)

I do not believe in optional portions of fixed-length fields -- they
make parsing things an absolute nightmare.  Besides, this format has
been in practice for a year now and appears to be the best for
interfacing with SMTP and NNTP gateways.


......Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                                     <fielding@ics.uci.edu>
                     <URL:http://www.ics.uci.edu/dir/grad/Software/fielding>
Received on Thursday, 1 December 1994 18:06:41 UTC