Re: INTEGOK (aka CONTENT-MD5) from Larry Masinter on 1996-04-08 (ietf-http-wg@w3.org from April to June 1996)

From: Larry Masinter <masinter@parc.xerox.com>
Date: Mon, 8 Apr 1996 12:53:39 PDT
To: paulle@microsoft.com
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <96Apr8.125346pdt.2764@golden.parc.xerox.com>

Paul,

I'm sorry, but the wording is still not right, when you say:

>   Lastly,
>   the canonical form of text types in HTTP includes several
>   line break conventions, so conversion of all line breaks
>   to CR-LF is not required before computing or checking
>   the digest: any acceptable convention should be left
>   unaltered for inclusion in the digest.

The phrase "canonical form" is a well known technical term. It is used
in this context:

When you have a large set of items A, and an equivalence relationship
among those items E, such that two items a and b are deemed to be
equivalent if E(a,b), it is possible to define a 'canonical form' C of
items in A such that if C(a) = c, then E(a,c). Given a canonical form
C, E(x,y) iff C(x) = C(y). That is, the "canonical form" is a unique
form of an object that can be used for equality testing when testing
equivalent. 

In the context of MIME types, we say that there are several forms of a
text document, namely: one with CRs for linebreaks, one with CRLF for
linebreaks, and one with LF for linebreaks, and we wish these to be
deemed to be equivalent. For this reason, MIME designates the form
with CRLF to be the canonical form, so that you can determine
equivalence of two text streams by converting them to the canonical
form.

At least in SMTP mail, text types are presumed to be transported in
canonical form, and MD5 digests are computed on canonical form. By
computing MD5 digests of the canonical form, you are assured that
equivalent text forms will have the same digest.

Now, we decided that we did not wish HTTP to require transformation of
text times into canonical form before transmission, and this is fine.
However, subsequently also allowing the message digest to be computed
on a non-canonical form means that equivalent text streams will have
different message digests. I can live with that decision too, if
that's really what people want. (Canonicalizing a text stream while
computing the digest doesn't seem like it is computationally onerous,
though.) It is, however, totally unacceptable to make some statement
that

   "the canonical form of text types in HTTP includes several
   line break conventions,"

because it either represents a misuse of the phrase "canonical form",
or else asserts that two text streams that differ only by their line
break convention should not be treated equivalently.

Received on Monday, 8 April 1996 13:01:49 UTC