- From: Marc VanHeyningen <mvanheyn@cs.indiana.edu>
- Date: Tue, 20 Dec 1994 11:42:57 -0500
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
I'm sure nobody is surprised to hear me say that I believe the treatment of cannonicalization of objects in draft 01 of the spec still needs some work. Under the new spec, a server is compliant if it serves up ASCII plaintext as having, say CRs as line delimiters. A client, however, is compliant if it recognizes only CRLFs as line delimiters (recognizing CRs is recommended for tolerance but not required.) So different implementations may both be compilant with the spec but not interoperate properly. I hope we can agree that is not OK. Mind you, I don't even know what exactly the current spec is saying. If we are effectively requiring all implementations to understand all non-canonical forms that exist, I think we need to enumerate exactly what they are. Other than line breaks, I have no idea what non-canonical representations clients (and servers, for that matter) must understand. In the interest not of philosophical purity (as Phil H-B might say, screw philosophical purity) but making something clear and reasonable, I'd suggest something like the following phrasing: ---- Conversion to canonical form: Internet media types [cite 1590] are registered in a canonical form. In general, Object-Bodies transferred via HTTP must be represented in the appropriate canonical form prior to the application of Content-Encoding and/or Content-Transfer-Encoding, if any, and transmission. Object-Bodies with a Content-Type of text/*, however, may represent line breaks not only in the canonical form of CRLF, but also as CR or LF alone, used consistently within an Object-Body. Conforming implementations *must* accept any of these three byte sequences as representing a single line break in text/* Object-Bodies. RATIONALE: A handful of different local representations of textual files exists in current practice. Conversion to canonical form can pose a significant performance loss, while understanding different line break representations is not an inordinate burden, nor an excessive requirement beyond current practice. ---- Part of the question is whether it's implicit that all text/* types are ASCII-based and that CR and LF are the appropriate interpretations of octets 0D and 0A. The newest draft of MIME calls for such, as does section A.2 in the current HTTP draft, so I stuck to that path. I'm not thrilled about requiring implementations to accept different line break sequences, but we kind of have to either: 1. Require all implementations to canonicalize 2. Require all implementations to understand certain specific non-canonical forms (and say exactly what they are) 3. Require some kind of negotiation process by which servers and clients can indicate what variations each can understand We seem to mostly have consensus that 1) is a potentially too expensive. I think 3) is more complexity than we particularly want to endure right now [it would lead to something like "Accept-Encoding: unix-linebreaks" or maybe "Content-Transfer-Encoding: 8bit-sloppy-eol".] That leaves 2), which is probably what will come closest to satisfying folks. - Marc -- Marc VanHeyningen <URL:http://www.cs.indiana.edu/hyplan/mvanheyn.html>
Received on Tuesday, 20 December 1994 08:44:47 UTC