- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Wed, 13 Feb 2008 19:39:57 -0800
- To: Mark Nottingham <mnot@mnot.net>
- Cc: Julian Reschke <julian.reschke@gmx.de>, HTTP Working Group <ietf-http-wg@w3.org>
On Feb 12, 2008, at 8:59 PM, Mark Nottingham wrote: > Roy, if you disagree with consensus on this issue, please suggest > specific text to replace Julian's work. It isn't consensus until the people who have to change their implementations agree to do so. The change was applied in a way that I did not anticipate, which made it a new requirement on previously conforming implementations rather than a relaxation of the existing requirements. The issue did not require that much. http://www3.tools.ietf.org/wg/httpbis/trac/ticket/20 Here is the change that Julian made according to the issue: http://www3.tools.ietf.org/wg/httpbis/trac/changeset/209 [2.1.1 is deleted; the last para of 2.3.1 is replaced with HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. ] Here is what it said in p3 before that change: 2.1.1. Missing Charset Some HTTP/1.0 software has interpreted a Content-Type header without charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 ([ISO-8859-1]) and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly with an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See Section 2.3.1. ... 2.3.1. Canonicalization and Text Defaults Internet media types are registered with a canonical form. An entity-body transferred via HTTP messages MUST be represented in the appropriate canonical form prior to its transmission except for "text" types, as defined in the next paragraph. When in canonical form, media subtypes of the "text" type use CRLF as the text line break. HTTP relaxes this requirement and allows the transport of text media with plain CR or LF alone representing a line break when it is done consistently for an entire entity-body. HTTP applications MUST accept CRLF, bare CR, and bare LF as being representative of a line break in text media received via HTTP. In addition, if the text is represented in a character set that does not use octets 13 and 10 for CR and LF respectively, as is the case for some multi-byte character sets, HTTP allows the use of whatever octet sequences are defined by that character set to represent the equivalent of CR and LF for line breaks. This flexibility regarding line breaks applies only to text media in the entity-body; a bare CR or LF MUST NOT be substituted for CRLF within any of the HTTP control structures (such as header fields and multipart boundaries). If an entity-body is encoded with a content-coding, the underlying data MUST be in a form defined above prior to being encoded. The "charset" parameter is used with some media types to define the character set (Section 2.1) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See Section 2.1.1 for compatibility problems. ================ And here is what I suggest for a rewrite, merging both of the above sections under Media Types and inverting the "fantasy island" requirements of the original text to what is permitted in HTTP beyond the registration defaults of MIME. 2.3.1. Canonicalization and Text Media Types Internet media types are registered with a canonical form and defaults for the optional parameter values. An ideal HTTP entity-body would contain data formatted strictly according to that canonical form. However, HTTP does not require the sender to verify that an entity-body is in canonical form prior to transfer. Instead, an HTTP recipient MUST be prepared to accept and properly interpret several variances in the format of textual types, as described below, and treat other variances as errors. The "charset" parameter (Section 2.1) is used with some media types to indicate the character encoding of the data. When a media type is registered with a default charset value of "US-ASCII", it MAY be used to label data transmitted via HTTP in the "iso-8859-1" charset (a superset of US-ASCII) without including an explicit charset parameter on the media type. In addition, when a media type registered with a default charset value of "US-ASCII" is received via HTTP without a charset parameter or with a charset value of "iso-8859-1", the recipient MAY inspect the data for indications of a different character encoding and interpret the data accordingly if the encoding is a superset of US-ASCII or if the encoding can be determined within the first 16 octets of data and interpreted consistently thereafter. Note: The first variance is due to a significant portion of early HTTP user agents not parsing media type parameters and instead relying on a then-common default encoding of iso-8859-1. As a result, early server implementations avoided the use of charset parameters and user agents evolved to "sniff" for new character encodings as the Web expanded beyond iso-8859-1 content. The second variance is due to a certain popular user agent that employed an unsafe encoding detection and switching algorithm within documents that might contain user-provided data (see Section security.sniffing), the most common workaround for which is to supply a specific charset parameter even when the actual character encoding is unknown. When in canonical form, media subtypes of the "text" type use CRLF as the text line break. However, it is also commonplace for such types to be transmitted in HTTP with CR or LF alone indicating a line break and occasional for such types to be transmitted with a character encoding that requires some other set of octet sequence(s) to indicate a line break. HTTP recipients MUST accept and properly interpret CRLF, bare CR, and bare LF as indicating a line break when encountered within an entity-body received via HTTP that is labeled as a text type and provided in a character encoding that allows CRLF to indicate a line break. Note: Line breaks are specified in MIME with the expectation that they are enforced during email message composition, when it is scalable to ensure that every octet is placed in canonical form, and with the anticipation that a message may be transmitted or processed using line-oriented protocols. HTTP message generation, in contrast, is usually performed at high speed, encloses data that cannot be modified without also altering its metadata, and is processed using length-delimited protocols. ===================== > In the future, when you don't agree with emerging consensus, I'd > appreciate it if you tell us as soon as is practical. This is as soon as practical. The last discussion of it took place the day before I got hit by the bronchitis fever, and I did disagree with the proposal at that time. ....Roy
Received on Thursday, 14 February 2008 03:39:38 UTC