Re: MHTML/HTTP 1.1 Conflicts from Jim Gettys on 1998-01-26 (ietf-http-wg@w3.org from January to March 1998)

From: Jim Gettys <jg@pa.dec.com>
Date: Mon, 26 Jan 1998 09:18:14 -0800
To: Jacob Palme <jpalme@dsv.su.se>
Cc: Nick Shelness <shelness@lotus.com>, IETF working group on HTML in e-mail <mhtml@segate.sunet.se>, http-wg@cuckoo.hpl.hp.com, Nick_Shelness/SSW/Lotus@lotus.com
Message-Id: <9801261718.AA10739@pachyderm.pa.dec.com>
>  From: Jacob Palme <jpalme@dsv.su.se>
>  Date: Sat, 24 Jan 1998 19:00:45 +0100
>  To: jg@pa.dec.com (Jim Gettys)
>  Cc: Nick Shelness <shelness@lotus.com>,
>          IETF working group on HTML in e-mail <mhtml@SEGATE.SUNET.SE>,
>          http-wg@cuckoo.hpl.hp.com, Nick_Shelness/SSW/Lotus@lotus.com
>  Subject: Re: MHTML/HTTP 1.1 Conflicts
>  
>  > I believe that recommending things that decrease interactive HTTP
>  >performance
>  > is unlikely to fly.  Each line split is of order 1% of a HTTP request (~200
>  > bytes), which is already too verbose and hurting interactive feel and
>  > performance.  Naive implementations that tried to do this would be hurt
>  > the most...  The scars of the "Accept" disaster are still on everyone's
>  >back...
>  
>  I am reading the HTTP spec now, to check for possible problems with
>  MHTML. Since I have not read all of it yet, can you say if there is
>  anything at all in the HTTP spec which says anything about the format
>  of bodies, i.e. of what comes after the blank line which ends the
>  HTTP heading. If the HTTP spec just regards this as an arbitrary string
>  of octets, formatted according to its MIME content type, then there
>  will probably not be any risk of conflict between MHTML and HTTP.
>  

You've caught on to HTTP....  Great...

HTTP can transport absolutely arbitrary content.  There is NOTHING which
restricts the bytes that come after the heading.

With the possible exception of transforming caching proxies, nothing looks 
at the payload (entity, in HTTP speak).  This is close to a hypothetical 
case; while transforming caching proxies exist (e.g. AOL), I find it unlikely 
that they would look inside an MHTML message and attempt to transform the 
images they found. (and you could inhibit this, by having the HTTP origin 
server add a "no-transform" cache-control directive).

Otherwise, we couldn't transport the arbitrary datatypes (often binary) 
in use in the Web.

So as far as HTTP is concerned, MHTML is just another bag of bits to
be transported.


>  In particular, does specifications about header line length, header folding,
>  end-of-line characters, etc., in the HTTP spec clearly say that these
>  specs only apply to lines in the HTTP header! If it does not say so,
>  but this is the intention, you need only say this more clearly, and
>  all conflicts with MHTML will disappear.
>  

Hopefully, Roy's words will make this clear when he has a chance to
draft them this week.  This is why we've added the whole multipart thing
as a full fledged issue on our issues list; it wasn't clear to you,
and therefore the HTTP spec failed; by definition, then, the HTTP spec's
use and non-use of Multipart needs clarification.

>  A possible exception from what I write above is the message/http
>  body part. I have scanned through the HTTP 1.1 draft and found message/http
>  mentioned twice: Once in the description of Trace, and once in chapter
>  19.1 which specifies the message/http MIME part.
>  
>  Certainly, if a message/http body part is transported through e-mail,
>  it must follow e-mail rules for heading lines, line folding, line
>  breaks, etc. So possible chapter 19.1 should have added to it (to
>  avoid confusion) that if a message/http body part is transported
>  through HTTP (in the trace) then its use of line length, line folding
>  and line breaks can follow HTTP conventions, but if a message/http
>  body part is transported through e-mail, then it must adhere to
>  e-mail rules in this respect.
>  

Yes.  Exactly.  I suspect that is a worthwhile thing to add to our spec,
as it is the definition of the http message type.

>  If you do not specify Content-Base in the HTTP spec, but if Content-
>  Base is specified in MHTML, then there is of course a possibility
>  that software based on MHTML will put Content-Base headers into
>  HTTP headings (unless you explicitly forbid this).

Why would they do this?  Content-Base isn't going to have any meaning
to HTTP clients, in any case. It would be ignored by a receiving client.

>  
>  > I think the best we can do is point out to implementers who hope to
>  > share MHTML and HTTP code that MHTML has limitations in this area, so
>  > that they can write the code right the first time, rather than twice.
>  
>  It is e-mail, rather than MHTML, which has limitiations. You could
>  write something like this, perhaps as added text in chapter 3.7.1
>  of the HTTP spec?

Seems like a good suggestion as to where to put it.

>  
>     The same content may sometimes be sent through e-mail, sometimes
>     through http. E-mail has different rules than http regarding
>     line length (preferred less than 76 characters in headings,
>     long lines are more often folded, in particular long URLs
>     are sometimes folded by inserting LWS which must be removed
>     before using the URL, line breaks must be CRLF, not bare CR
>     or bare LF). If an object is retrieved through http and then
>     forwarded through e-mail, this may require conversion. Such
>     conversion may invalidate checksums used for digital seals,
>     digitals signatures, etc. This can be avoided if the resource
>     is formatted, also in its http version, according to e-mail
>     rules.

Yes.  If the text content to be transported via HTTP is in e-mail rules,
nothing should break.  As far as that goes, the bytes could be arbitrary,
and nothing would break.
>  
>  > We can't at this date even contemplate splitting long URL's; it would break
>  > huge numbers of implementations.  You need to get in your head that HTTP
>  > is a binary, 8 bit clean transport (streaming RPC system) of arbitrary
>  > datatypes; it uses MIME like message syntax, but isn't really MIME.
>  
>  Certainly not in HTTP headings. But what about headings inside multipart
>  bodies, transported through HTTP?
>  

Again, with a few specific exceptions like range requests and multipart
form data, HTTP doesn't use multipart for anything.  HTTP isn't going
to look at the headings inside of multipart bodies, with the
noted exceptions of multipart types defined by and used by HTTP itself.

>  > The long line problem really doesn't apply to HTTP at all.  The HTTP
>  >protocol
>  > (as opposed to "smart" proxies speaking HTTP that might attempt to transform
>  > content, of which there are a few) does not look inside the payload (even
>  > these proxies would be unlikely to try to transform things inside of a MHTML
>  > message).  If the content is multipart, most HTTP implementations won't
>  > look at the content at all, or any of your headers (with this possible
>  > exception of transforming proxies, though I expect it is unlikely they
>  > will bother).
>  
>  Is there no user requirement among http users to be able to retrieve
>  resources through http and forward them through e-mail? If there is such
>  a user requirement, and if there is another user requirement that
>  security checksums should work accross such forwarding, then you do
>  have a problem with long lines, even if I can understand that you would
>  much prefer that there was no such problem.

The reality of the web is that there is tons of content without any line 
length enforced already exisiting...   It just isn't possible to get the 
hundreds of millions of documents reformatted at this date.  Think of generic 
HTML on the Web as just another binary data type rather than as text, and 
you'll see how to deal with it.  To forward this HTML through email safely, 
you are certainly have to encode it before transmission. If the bags of 
bits get messed with, the security checksums will clearly fail.

What this means, is that there are two possible cases:
	1) the HTML to be mailed could be reformatted to email rules, if no 
	security measures have been applied...  
	2) the HTML gets encoded and transmitted, and treated as a bag
	of bits, if it has any security wrappers.

Easier might be just to say that all HTML forwarded via email must be
encoded before transfer (at the cose of some bytes on the wire, but
mail isn't real time), rather than having two cases to handle.  Any HTML
renderer people use is likely to be willing to accept HTTP relaxed rules
for line length, line termination and URL length.

The world is a jungle out there...  Ugh...

>  
>  > >  I think it would be very nice if the HTTP spec referenced the MHTML
>  > >  spec saying something like this:
>  > >
>  > >     When aggregates of linked objects (for example HTML plus in-line
>  >images)
>  > >     are transported, they should be formatted using the MHTML standard.
>  > >
>  >
>  > This is a simple plug for MHTML, and doesn't address the issue for which
>  > it is intended (to let implementers of 2068 be aware that there are two
>  > definitions for Content-Base out there; the 2068 definition and the MHTML
>  > definition).  I don't think that the Microsoft Word world would necessarily
>  > appreciate that language, for example.  It is inappropriate for a protocol
>  > specification.
>  
>  The importance is that there should not be two, possibly conflicting
>  standards for the format of body contents (what comes after the end
>  of the HTTP heading). By referencing MHTML and MIME, you just say that
>  the format of body contents is not specifed by the HTTP specs but by
>  the MIME and MHTML specs. This is already implied in the HTTP 1.1 spec
>  in the following quote:
>  
>     3.7.1	Canonicalization and Text Defaults
>  
>     Internet media types are registered with a canonical form. In
>     general, an entity-body transferred via HTTP messages MUST be
>     represented in the appropriate canonical form prior to its
>     transmission; the exception is "text" types, as defined in
>     the next paragraph.
>  
>     When in canonical form, media subtypes of the "text" type use
>     CRLF as the text line break. HTTP relaxes this requirement and
>     allows the transport of text media with plain CR or LF alone
>     representing a line break when it is done consistently for an
>     entire entity-body. HTTP applications MUST accept CRLF, bare CR,
>     and bare LF as being representative of a line break in text
>     media received via HTTP. In addition, if the text is
>     represented in a character set that does not use octets 13 and
>     10 for CR and LF respectively, as is the case for some
>     multi-byte character sets, HTTP allows the use of whatever
>     octet sequences are defined by that character set to represent
>     the equivalent of CR and LF for line breaks. This flexibility
>     regarding line breaks applies only to text media in the
>     entity-body; a bare CR or LF MUST NOT be substituted for
>     CRLF within any of the HTTP control structures (such as
>     header fields and multipart boundaries).
>  

Yes, but as I note above, HTTP has relaxed these requirements (or rather, 
the reality of what people have put up as content on the Web has forced 
us to acknowledge the relaxation of the MIME text requirements, at least 
as used by HTTP).  

>  > What I need is to know how to reference Content-Base in MHTML, so that when
>  > I delete Content-Base from the HTTP spec, I can put a reference to the MHTML
>  > spec as well as 2068 (needs to go in the "Changes from RFC 2068" section
>  > I have written for the next HTTP draft).  E.g. title of the RFC, authors,
>  > what section of MHTML is appropriate, etc.
>  
>  You might also mention that Content-Location is used both in MHTML and
>  in HTTP in similar ways, but that the exact usage should be taken from
>  the appropriate standard.
>  
>  The present official version of MHTML is in RFC 2110:
>  
>  2110 MIME E-mail Encapsulation of Aggregate Documents, such as HTML
>       (MHTML). J. Palme, A. Hopmann. March 1997. (Format: TXT=41961 bytes)
>       (Status: PROPOSED STANDARD)
>  
>  We will submit a new, revised version soon, whether it will be before
>  or after the HTTP 1.1 draft standard is difficult to say. Best, of course,
>  is if you reference the new version. 

> There are no big differences, but
>  there are differences. For example, in RFC 2110, Content-Base is only
>  valid for its immediate content, not for subparts. In the new spec,
>  Content-Base can give a base also for subparts (unless these have
>  their own base indication). The description of how relative URLs
>  are handled is also slightly different. Most of these changes have
>  been made in order to better agree with the new RFC 1808.
>  
>  The present reference to the new version of RFC 2110 is
>  MIME Encapsulation of Aggregate Documents, such as HTML (MHTML).
>  J. Palme, A. Hopmann, N. Shellness, November 1997.
>  
>  If you reference that document, the RFC editor will change the
>  reference to the RFC number of our new version of the MHTML
>  proposed standard.
>  
>  MTHML is also based on two other proposed standards, but I do not think
>  you need reference them. They are:
>  
>  2112 The MIME Multipart/Related Content-type. E. Levinson. February
>       1997. (Format: TXT=17052 bytes) (Obsoletes RFC1872) (Status: PROPOSED
>       STANDARD)
>  
>  2111 Content-ID and Message-ID Uniform Resource Locators. E. Levinson.
>       February 1997. (Format: TXT=9099 bytes) (Status: PROPOSED STANDARD)
>  
>  --- --- ---

Thanks.  The reference won't be normative in the HTTP case, so I don't think
it matters too much.

The RFC editor generally updates references to the latest document
as part of his/her work.

--
Jim Gettys
Industry Standards and Consortia
Digital Equipment Corporation
Visting Scientist, World Wide Web Consortium, M.I.T.
http://www.w3.org/People/Gettys/
jg@w3.org, jg@pa.dec.com
Received on Monday, 26 January 1998 09:21:37 UTC