Re: Closure on canonicalization, I hope from Larry Masinter on 1994-12-23 (ietf-http-wg@w3.org from October to December 1994)

From: Larry Masinter <masinter@parc.xerox.com>
Date: Fri, 23 Dec 1994 14:49:32 PST
To: mvanheyn@cs.indiana.edu
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <94Dec23.144939pst.2760@golden.parc.xerox.com>
As long as we're going to redefine the Internet media types to be "as
registered for MIME with some exceptions", I think we might as well
handle the character set issue as well as the EOL convention one.

That is: object types in HTTP denote the corresponding Internet Media
Type as defined in RFC 1521 and are registered via the procedure
outlined in RFC 1590. However, HTTP makes two modifications to the 
interpretation of "text/*" media types:

a) the default character set (if no charset parameter is supplied) is
   ISO-8859-1, not US-ASCII.

b) no requirement is placed that documents which *could* be
   interpreted as US-ASCII must be labelled such. (This was a MIME
   requirement, but shouldn't be for HTTP).  

c) the end of line convention depends on the character set. In
   particular, while US-ASCII requires 'CR LF' as a character set,
   the end of line in ISO-8859-1 may consist of
      a single CR
      a single LF
      a combined CR LF
   and recievers must be prepared to determine the end of line
   convention used in the text/ type.

d) character sets such as UNICODE (where the data is represented as 
   a sequence of pairs of octets representing the hex coding of
   the data) are allowed. (This is apparently not true in the latest
   MIME draft).

e) character sets must be registered for use within HTTP; in addition
   to the current information that is included in the character set
   registration by IANA, the registration must describe the end of
   line convention for the character set, and include information
   about how to map the character set definition into other character
   sets or icons. 

The HTTP standard should define use of character sets US-ASCII and
ISO-8859-1, at the least, and probably could include descriptions of
UNICODE-1-1 (two octets per character), UNICODE-1-1-UTF-8 and
UNICODE-1-1-UTF-7 as per RFC1641 and RFC1642.
     
I'm happier with a definition that defines how clients are supposed to
*interpret* the data than one that speaks of 'canonicalization' or
'conversion', or where it is somehow the transport (HTTP) that is
messing up. 


> I'd suggest something like the following phrasing:
> ----
>   Conversion to canonical form:

>   Internet media types [cite 1590] are registered in a canonical form.
>   In general, Object-Bodies transferred via HTTP must be represented
>   in the appropriate canonical form prior to the application of
>   Content-Encoding and/or Content-Transfer-Encoding, if any, and
>   transmission.

>   Object-Bodies with a Content-Type of text/*, however, may represent
>   line breaks not only in the canonical form of CRLF, but also as CR
>   or LF alone, used consistently within an Object-Body.  Conforming
>   implementations *must* accept any of these three byte sequences as
>   representing a single line break in text/* Object-Bodies.

>   RATIONALE:  A handful of different local representations of textual
>   files exists in current practice.  Conversion to canonical form can
>   pose a significant performance loss, while understanding different
>   line break representations is not an inordinate burden, nor an
>   excessive requirement beyond current practice.
> ----
> Part of the question is whether it's implicit that all text/* types
> are ASCII-based and that CR and LF are the appropriate interpretations
> of octets 0D and 0A.  The newest draft of MIME calls for such, as does
> section A.2 in the current HTTP draft, so I stuck to that path.

> I'm not thrilled about requiring implementations to accept different
> line break sequences, but we kind of have to either:

> 1. Require all implementations to canonicalize
> 2. Require all implementations to understand certain specific
>    non-canonical forms (and say exactly what they are)
> 3. Require some kind of negotiation process by which servers and
>    clients can indicate what variations each can understand

> We seem to mostly have consensus that 1) is a potentially too
> expensive.  I think 3) is more complexity than we particularly want to
> endure right now [it would lead to something like "Accept-Encoding:
> unix-linebreaks" or maybe "Content-Transfer-Encoding:
> 8bit-sloppy-eol".]  That leaves 2), which is probably what will come
> closest to satisfying folks.

> - Marc
Received on Friday, 23 December 1994 14:53:22 UTC