Closure on canonicalization, I hope from Marc VanHeyningen on 1994-12-20 (ietf-http-wg@w3.org from October to December 1994)

From: Marc VanHeyningen <mvanheyn@cs.indiana.edu>
Date: Tue, 20 Dec 1994 11:42:57 -0500
To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <6413.787941777@moose.cs.indiana.edu>
I'm sure nobody is surprised to hear me say that I believe the
treatment of cannonicalization of objects in draft 01 of the spec
still needs some work.

Under the new spec, a server is compliant if it serves up ASCII
plaintext as having, say CRs as line delimiters.  A client, however,
is compliant if it recognizes only CRLFs as line delimiters
(recognizing CRs is recommended for tolerance but not required.)  So
different implementations may both be compilant with the spec but not
interoperate properly.  I hope we can agree that is not OK.

Mind you, I don't even know what exactly the current spec is saying.
If we are effectively requiring all implementations to understand all
non-canonical forms that exist, I think we need to enumerate exactly
what they are.  Other than line breaks, I have no idea what
non-canonical representations clients (and servers, for that matter)
must understand.

In the interest not of philosophical purity (as Phil H-B might say,
screw philosophical purity) but making something clear and reasonable,
I'd suggest something like the following phrasing:
----
  Conversion to canonical form:

  Internet media types [cite 1590] are registered in a canonical form.
  In general, Object-Bodies transferred via HTTP must be represented
  in the appropriate canonical form prior to the application of
  Content-Encoding and/or Content-Transfer-Encoding, if any, and
  transmission.

  Object-Bodies with a Content-Type of text/*, however, may represent
  line breaks not only in the canonical form of CRLF, but also as CR
  or LF alone, used consistently within an Object-Body.  Conforming
  implementations *must* accept any of these three byte sequences as
  representing a single line break in text/* Object-Bodies.

  RATIONALE:  A handful of different local representations of textual
  files exists in current practice.  Conversion to canonical form can
  pose a significant performance loss, while understanding different
  line break representations is not an inordinate burden, nor an
  excessive requirement beyond current practice.
----
Part of the question is whether it's implicit that all text/* types
are ASCII-based and that CR and LF are the appropriate interpretations
of octets 0D and 0A.  The newest draft of MIME calls for such, as does
section A.2 in the current HTTP draft, so I stuck to that path.

I'm not thrilled about requiring implementations to accept different
line break sequences, but we kind of have to either:

1. Require all implementations to canonicalize
2. Require all implementations to understand certain specific
   non-canonical forms (and say exactly what they are)
3. Require some kind of negotiation process by which servers and
   clients can indicate what variations each can understand

We seem to mostly have consensus that 1) is a potentially too
expensive.  I think 3) is more complexity than we particularly want to
endure right now [it would lead to something like "Accept-Encoding:
unix-linebreaks" or maybe "Content-Transfer-Encoding:
8bit-sloppy-eol".]  That leaves 2), which is probably what will come
closest to satisfying folks.

- Marc
--
Marc VanHeyningen  <URL:http://www.cs.indiana.edu/hyplan/mvanheyn.html>
Received on Tuesday, 20 December 1994 08:44:47 UTC