Re: text/* types and charset defaults from Julian Reschke on 2008-01-20 (ietf-http-wg@w3.org from January to March 2008)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Sun, 20 Jan 2008 12:28:41 +0100
To: 'HTTP Working Group' <ietf-http-wg@w3.org>
Message-ID: <47933069.6090908@gmx.de>
Larry Masinter wrote:
> "If we couldn't fix it then, why do you imagine you can fix it now?"

Depends on the definition of "fixing" :-)

Given almost 10 additional years of experience, and observing what 
software actually does today, we really have sufficient reason to 
improve what the spec says.

> I'm arguing for documenting current practice, making some recommendations
> for safe behavior, and moving on.

+1

> We certainly *wanted* to change the default charset for HTTP when working on
> 2026 but couldn't find a way around the impasse between backward
> compatability, client sniffing, server misconfiguration et al.

So let's assume we remove 
<http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p3-payload-01.html#rfc.section.2.3.1.p.4>:

"The "charset" parameter is used with some media types to define the 
character set (Section 2.1) of the data. When no explicit charset 
parameter is provided by the sender, media subtypes of the "text" type 
are defined to have a default charset value of "ISO-8859-1" when 
received via HTTP. Data in character sets other than "ISO-8859-1" or its 
subsets MUST be labeled with an appropriate charset value. See Section 
2.1.1 for compatibility problems."

Would that break anything in practice today?

> I think the main thing to do is to document the actual situation
> sufficiently such that new HTTP implementations don't break things:
> 
> 1) servers (senders): don't make up a charset if you don't know what it is
> (this is a good rule for any kind of descriptive information, isn't it?). 

Right.

> 2) clients (receivers): servers (senders) are unfortunately often
> misconfigured and will label things with the wrong charset. (This is often
> because lots of software uses 'mime type' when what's wanted is usually
> 'content type' and the parameters get lost). But guessing blindly and
> ignoring what the server sent seems like a bad idea, and even has security
> implications. So "beware".

Right; maybe also point to <http://www.w3.org/2001/tag/doc/mime-respect>.

> 3) everybody: even if you agree about charset, accept other end-of-line
> terminations (not just CRLF which MIME required.)

<http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p3-payload-01.html#rfc.section.2.3.1.p.2> 
currently says:

"When in canonical form, media subtypes of the "text" type use CRLF as 
the text line break. HTTP relaxes this requirement and allows the 
transport of text media with plain CR or LF alone representing a line 
break when it is done consistently for an entire entity-body. HTTP 
applications MUST accept CRLF, bare CR, and bare LF as being 
representative of a line break in text media received via HTTP. In 
addition, if the text is represented in a character set that does not 
use octets 13 and 10 for CR and LF respectively, as is the case for some 
multi-byte character sets, HTTP allows the use of whatever octet 
sequences are defined by that character set to represent the equivalent 
of CR and LF for line breaks. This flexibility regarding line breaks 
applies only to text media in the entity-body; a bare CR or LF MUST NOT 
be substituted for CRLF within any of the HTTP control structures (such 
as header fields and multipart boundaries)."

Does this need fixing?

> (It's really receiver & sender, not client & server, since the rules should
> apply for file upload as well as anything else.)

Correct.

So my proposal would be:

- drop paragraph 4 (ISO-8859-1),

- add a note covering Larry's points 1) and 2), and

- mention this is a normative change in 
<http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p3-payload-01.html#changes.from.rfc.2616>.

BR, Julian
Received on Sunday, 20 January 2008 11:29:04 UTC