Re: Unknown text/* subtypes [i20] from Frank Ellermann on 2008-02-12 (ietf-http-wg@w3.org from January to March 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Tue, 12 Feb 2008 18:23:57 +0100
To: ietf-http-wg@w3.org
Message-ID: <foskkk$l4a$1@ger.gmane.org>

Julian Reschke wrote:

>| When in canonical form, media subtypes of the "text" type use
>| CRLF as the text line break. HTTP relaxes this requirement and
>| allows the transport of text media with plain CR or LF alone 
>| representing a line break when it is done consistently for an
>| entire entity-body. 

I'm not sure about this, it was found to be strange enough for a
dishonourable note in the future net-utf8 RFC.  I think what is 
really going on is something like this:

| HTTP does not depend on this canonical lineend in "text" types,
| and therefore does not require it in the content.

>| HTTP applications MUST accept CRLF, bare CR, and bare LF as being 
>| representative of a line break in text media received via HTTP.
>| In addition, if the text is represented in a character set that
>| does not use octets 13 and 10 for CR and LF respectively, as is
>| the case for some multi-byte character sets, HTTP allows the use
>| of whatever octet sequences are defined by that character set to 
>| represent the equivalent of CR and LF for line breaks.

I think that's beside the point.  AFAIK XML permits U+0085 NEL, and
text/xml exists (but maybe I confused XML 1.1 with XML 1.0 here).
If the charset uses octets 0D and 0A for U+000D and U+000A does not
necessarily affect octet 85 used as U+0085 in some legacy charsets.

HTTP does not really "allow" whatever represents a line break, it
simply does not "care" (within bodies or chunks).  How applications
interpret content is their business.  As far as HTTP is concerned
applications cannot trust that text/* comes with a canonical CRLF.

What really matters for HTTP is the header (and anything else not
belonging to the content).  And *there* CRLF is of course REQUIRED.

> HTTP/1.1 recipients MUST respect the charset label provided by
> the sender

Please justify this MUST strictly following RFC 2119, or replace
it by a SHOULD.  Many HTTP servers (even including IANA and W3C)
get some content types and their charsets where applicable wrong.  

> those user agents that have a provision to "guess" a charset 
> MUST use the charset from the content-type field

Please justify also this MUST using RFC 2119 terms.  Let's better
face it as it is, many HTTP servers are liars in practice, and do
not deserve too much respect.

 Frank

Received on Tuesday, 12 February 2008 17:22:51 UTC