Re: Unknown text/* subtypes [i20] from Julian Reschke on 2008-02-14 (ietf-http-wg@w3.org from January to March 2008)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Thu, 14 Feb 2008 14:48:20 +0100
To: "Roy T. Fielding" <fielding@gbiv.com>
CC: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <47B446A4.70108@gmx.de>
Roy T. Fielding wrote:
> 
> On Feb 12, 2008, at 8:59 PM, Mark Nottingham wrote:
>> Roy, if you disagree with consensus on this issue, please suggest 
>> specific text to replace Julian's work.
> 
> It isn't consensus until the people who have to change their
> implementations agree to do so.  The change was applied in a way
> that I did not anticipate, which made it a new requirement on
> previously conforming implementations rather than a relaxation
> of the existing requirements.  The issue did not require that much.

And I don't think we realized that we may be *adding* requirements; the 
goals were (IMHO) to reduce inconsistency of specs and to allow what UAs 
indeed do today. It seems the conclusion is that in this case these 
goals are contradictory.

>    http://www3.tools.ietf.org/wg/httpbis/trac/ticket/20
> 
> Here is the change that Julian made according to the issue:
> 
>    http://www3.tools.ietf.org/wg/httpbis/trac/changeset/209

...which was what the issue resolution asked for; so the mistake 
happened earlier.

In the meantime I have backed out the change 
(<http://www3.tools.ietf.org/wg/httpbis/trac/changeset/211>), and I'd 
propose to re-open the issue.

> ...
> And here is what I suggest for a rewrite, merging both of the above
> sections under Media Types and inverting the "fantasy island"
> requirements of the original text to what is permitted in HTTP
> beyond the registration defaults of MIME.
> 
> 2.3.1.  Canonicalization and Text Media Types
> 
>    Internet media types are registered with a canonical form and
>    defaults for the optional parameter values.  An ideal HTTP
>    entity-body would contain data formatted strictly according to that
>    canonical form.  However, HTTP does not require the sender to verify
>    that an entity-body is in canonical form prior to transfer.  Instead,
>    an HTTP recipient MUST be prepared to accept and properly interpret
>    several variances in the format of textual types, as described below,
>    and treat other variances as errors.

Good introduction.

>    The "charset" parameter (Section 2.1) is used with some media types
>    to indicate the character encoding of the data.  When a media type is
>    registered with a default charset value of "US-ASCII", it MAY be used
>    to label data transmitted via HTTP in the "iso-8859-1" charset (a
>    superset of US-ASCII) without including an explicit charset parameter
>    on the media type.  In addition, when a media type registered with a
>    default charset value of "US-ASCII" is received via HTTP without a
>    charset parameter or with a charset value of "iso-8859-1", the
>    recipient MAY inspect the data for indications of a different
>    character encoding and interpret the data accordingly if the encoding
>    is a superset of US-ASCII or if the encoding can be determined within
>    the first 16 octets of data and interpreted consistently thereafter.

Q: so if a text type defines a different default, such as "UTF-8", we 
disallow defaulting to ISO-8859-1 and sniffing?

>       Note: The first variance is due to a significant portion of early
>       HTTP user agents not parsing media type parameters and instead
>       relying on a then-common default encoding of iso-8859-1.  As a
>       result, early server implementations avoided the use of charset
>       parameters and user agents evolved to "sniff" for new character
>       encodings as the Web expanded beyond iso-8859-1 content.  The
>       second variance is due to a certain popular user agent that
>       employed an unsafe encoding detection and switching algorithm
>       within documents that might contain user-provided data (see
>       Section security.sniffing), the most common workaround for which
>       is to supply a specific charset parameter even when the actual
>       character encoding is unknown.

Q: so specifying ISO-8859-1 was a countermeasure to broken content 
sniffing? Is that version of that UA still common enough to have this 
exception?

>    When in canonical form, media subtypes of the "text" type use CRLF as
>    the text line break.  However, it is also commonplace for such types
>    to be transmitted in HTTP with CR or LF alone indicating a line
>    break and occasional for such types to be transmitted with a
>    character encoding that requires some other set of octet sequence(s)
>    to indicate a line break.  HTTP recipients MUST accept and properly
>    interpret CRLF, bare CR, and bare LF as indicating a line break when
>    encountered within an entity-body received via HTTP that is labeled
>    as a text type and provided in a character encoding that allows CRLF
>    to indicate a line break.
> 
>       Note: Line breaks are specified in MIME with the expectation that
>       they are enforced during email message composition, when it is
>       scalable to ensure that every octet is placed in canonical form,
>       and with the anticipation that a message may be transmitted or
>       processed using line-oriented protocols.  HTTP message generation,
>       in contrast, is usually performed at high speed, encloses data
>       that cannot be modified without also altering its metadata, and
>       is processed using length-delimited protocols.

Ok.

Sounds good to me, except for:

- Q: do we need to state that we overrule RFC3023 (WRT default charset 
for text/xml?)

- we still need the security consideration WRT sniffing.

> ...

BR, Julian
Received on Thursday, 14 February 2008 13:48:38 UTC