- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Wed, 13 Feb 2008 19:39:57 -0800
- To: Mark Nottingham <mnot@mnot.net>
- Cc: Julian Reschke <julian.reschke@gmx.de>, HTTP Working Group <ietf-http-wg@w3.org>
On Feb 12, 2008, at 8:59 PM, Mark Nottingham wrote:
> Roy, if you disagree with consensus on this issue, please suggest
> specific text to replace Julian's work.
It isn't consensus until the people who have to change their
implementations agree to do so. The change was applied in a way
that I did not anticipate, which made it a new requirement on
previously conforming implementations rather than a relaxation
of the existing requirements. The issue did not require that much.
http://www3.tools.ietf.org/wg/httpbis/trac/ticket/20
Here is the change that Julian made according to the issue:
http://www3.tools.ietf.org/wg/httpbis/trac/changeset/209
[2.1.1 is deleted; the last para of 2.3.1 is replaced with
HTTP/1.1 recipients MUST respect the charset label provided by the
sender; and those user agents that have a provision to "guess" a
charset
MUST use the charset from the content-type field if they support
that
charset, rather than the recipient's preference, when initially
displaying
a document.
]
Here is what it said in p3 before that change:
2.1.1. Missing Charset
Some HTTP/1.0 software has interpreted a Content-Type header without
charset parameter incorrectly to mean "recipient should guess."
Senders wishing to defeat this behavior MAY include a charset
parameter even when the charset is ISO-8859-1 ([ISO-8859-1]) and
SHOULD do so when it is known that it will not confuse the
recipient.
Unfortunately, some older HTTP/1.0 clients did not deal properly
with
an explicit charset parameter. HTTP/1.1 recipients MUST respect the
charset label provided by the sender; and those user agents that
have
a provision to "guess" a charset MUST use the charset from the
content-type field if they support that charset, rather than the
recipient's preference, when initially displaying a document. See
Section 2.3.1.
...
2.3.1. Canonicalization and Text Defaults
Internet media types are registered with a canonical form. An
entity-body transferred via HTTP messages MUST be represented in the
appropriate canonical form prior to its transmission except for
"text" types, as defined in the next paragraph.
When in canonical form, media subtypes of the "text" type use
CRLF as
the text line break. HTTP relaxes this requirement and allows the
transport of text media with plain CR or LF alone representing a
line
break when it is done consistently for an entire entity-body. HTTP
applications MUST accept CRLF, bare CR, and bare LF as being
representative of a line break in text media received via HTTP. In
addition, if the text is represented in a character set that does
not
use octets 13 and 10 for CR and LF respectively, as is the case for
some multi-byte character sets, HTTP allows the use of whatever
octet
sequences are defined by that character set to represent the
equivalent of CR and LF for line breaks. This flexibility regarding
line breaks applies only to text media in the entity-body; a bare CR
or LF MUST NOT be substituted for CRLF within any of the HTTP
control
structures (such as header fields and multipart boundaries).
If an entity-body is encoded with a content-coding, the underlying
data MUST be in a form defined above prior to being encoded.
The "charset" parameter is used with some media types to define the
character set (Section 2.1) of the data. When no explicit charset
parameter is provided by the sender, media subtypes of the "text"
type are defined to have a default charset value of "ISO-8859-1"
when
received via HTTP. Data in character sets other than
"ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value. See
Section 2.1.1 for compatibility problems.
================
And here is what I suggest for a rewrite, merging both of the above
sections under Media Types and inverting the "fantasy island"
requirements of the original text to what is permitted in HTTP
beyond the registration defaults of MIME.
2.3.1. Canonicalization and Text Media Types
Internet media types are registered with a canonical form and
defaults for the optional parameter values. An ideal HTTP
entity-body would contain data formatted strictly according to that
canonical form. However, HTTP does not require the sender to verify
that an entity-body is in canonical form prior to transfer.
Instead,
an HTTP recipient MUST be prepared to accept and properly interpret
several variances in the format of textual types, as described
below,
and treat other variances as errors.
The "charset" parameter (Section 2.1) is used with some media types
to indicate the character encoding of the data. When a media
type is
registered with a default charset value of "US-ASCII", it MAY be
used
to label data transmitted via HTTP in the "iso-8859-1" charset (a
superset of US-ASCII) without including an explicit charset
parameter
on the media type. In addition, when a media type registered with a
default charset value of "US-ASCII" is received via HTTP without a
charset parameter or with a charset value of "iso-8859-1", the
recipient MAY inspect the data for indications of a different
character encoding and interpret the data accordingly if the
encoding
is a superset of US-ASCII or if the encoding can be determined
within
the first 16 octets of data and interpreted consistently thereafter.
Note: The first variance is due to a significant portion of early
HTTP user agents not parsing media type parameters and instead
relying on a then-common default encoding of iso-8859-1. As a
result, early server implementations avoided the use of charset
parameters and user agents evolved to "sniff" for new character
encodings as the Web expanded beyond iso-8859-1 content. The
second variance is due to a certain popular user agent that
employed an unsafe encoding detection and switching algorithm
within documents that might contain user-provided data (see
Section security.sniffing), the most common workaround for which
is to supply a specific charset parameter even when the actual
character encoding is unknown.
When in canonical form, media subtypes of the "text" type use
CRLF as
the text line break. However, it is also commonplace for such types
to be transmitted in HTTP with CR or LF alone indicating a line
break and occasional for such types to be transmitted with a
character encoding that requires some other set of octet sequence(s)
to indicate a line break. HTTP recipients MUST accept and properly
interpret CRLF, bare CR, and bare LF as indicating a line break when
encountered within an entity-body received via HTTP that is labeled
as a text type and provided in a character encoding that allows CRLF
to indicate a line break.
Note: Line breaks are specified in MIME with the expectation that
they are enforced during email message composition, when it is
scalable to ensure that every octet is placed in canonical form,
and with the anticipation that a message may be transmitted or
processed using line-oriented protocols. HTTP message
generation,
in contrast, is usually performed at high speed, encloses data
that cannot be modified without also altering its metadata, and
is processed using length-delimited protocols.
=====================
> In the future, when you don't agree with emerging consensus, I'd
> appreciate it if you tell us as soon as is practical.
This is as soon as practical. The last discussion of it took place
the day before I got hit by the bronchitis fever, and I did disagree
with the proposal at that time.
....Roy
Received on Thursday, 14 February 2008 03:39:38 UTC