Re: Comments on the HTTP/1.0 draft. from Marc VanHeyningen on 1994-11-30 (ietf-http-wg@w3.org from October to December 1994)

From: Marc VanHeyningen <mvanheyn@cs.indiana.edu>
Date: Wed, 30 Nov 1994 09:05:05 -0500
To: "Roy T. Fielding" <fielding@avron.ICS.UCI.EDU>
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <10635.786204305@moose.cs.indiana.edu>
Thus wrote: "Roy T. Fielding"
>Marc VanHeyningen writes:
>> Rather egregiously missing is a reference to transmitting network
>> objects in canonical form.  Section 3.2 should mention this; a
>> reference to the canonical encoding model in Appendix G of RFC 1521
>> (specifically step 2) probably should suffice.  The only place this is
>> hinted at is in the tolerance section of the appendices on tolerance
>> of broken implementations, but the spec should explicitly say what the
>> proper behavior is, just in case any servers every actually do that. :-)
>
>The specified behavior will be "no canonical encoding of the object-body
>is required before network transfer via HTTP, though gateways may need
>to perform such canonical encoding before forwarding a message via a
>different protocol.  However, servers may wish to perform such encoding
>(i.e. to compensate for unusual document structures), and
>may do so at their discretion."

I must not be understanding what you're saying correctly.  Why is
canonical encoding unnecessary?  Do you really mean that any server,
on any architecture, can (for example) transmit text files using
whatever its local system convention for line breaks might happen to
be (CR, LF, CRLF, whatever) without standardizing it?  How can we be
passing local forms around between different machines and expect it to
work reliably?

Yes, I know that pretty much all existing servers run under UNIX and
just blindly send the UNIX line break without making any effort to
normalize it, but the spec should document correct behavior, with
existing behavior mentioned, as it currently is, in the appendix.  The
current document is a little strange, in that the appendix recommends
assuming any newline is a line break to tolerate bad servers/clients,
but nowhere in the document does it seem to say what the *correct*
behavior is, or why those programs are bad.  I believe strongly that
the correct behavior is to send things only in canonical form.

Actually, after thinking about this a little more, I realized the MIME
encoding model isn't adequate, because HTTP adds a new layer of
encoding ("Content-Encoding: x-gzip" or the like) and the spec needs
to explicitly state when that encoding gets done and undone relative
to canonicalization (if we include that) and CTE.  I think the model
should specify that content-encoding happens after canonicalization
but before the CTE, if any (there should be none, of course, normally)
is applied.

(Actually, I wouldn't object to outright prohibiting any CTE other
than clear ones like 7bit, 8bit and binary, but maybe there are
reasons to allow q-p and base64.)

>> As near as I can tell, the spec constrains all header values to be
>> US-ASCII, meaning nothing that is not US-ASCII may be contained in
>> them.  We might consider permitting non-US-ASCII information in at
>> least some headers, probably using RFC 1522's model.
>
>I'd rather not.  If there is a perceived need for non-US-ASCII information
>in header field value text and comments (I don't see any), then I think
>they should only be encoded by gateways during export.

I don't see an immediate overwhelming need, but it's there.  Plenty of
people who have names with non-ASCII characters in them like to
include those names in a From: header, for example.  It's not
necessarily urgent, but I think it will get used anyway in areas like
From: that are only used for logging and shouldn't break anything.
It's not as though it would be legal in URLs or Content-Types or
anything heavily interpreted, and I don't think it would break any
existing software.

I don't see how having them only be encoded by gateways will suffice.
How would one represent non-US-ASCII information in a header?
Specifically, how would one indicate what character set is being
employed, if not by using MIME part 2?  (Yes, it's kind of an ugly
wheel, but it works and it's backward-compatible.)

Oh, minor nit:  In the date section, the grammar makes the
Day-of-the-week component of mandatory.  I believe it should be made
optional, at least in 822/1123 style, since that's how it is in 822
(not to mention there's no good reason for it to be there, since it
doesn't provide any machine-useful information and won't normally be
viewed directly by a human.)
--
Marc VanHeyningen  <URL:http://www.cs.indiana.edu/hyplan/mvanheyn.html>
Received on Wednesday, 30 November 1994 06:07:14 UTC