Re: Unknown text/* subtypes [i20] from Roy T. Fielding on 2008-02-14 (ietf-http-wg@w3.org from January to March 2008)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Thu, 14 Feb 2008 11:16:34 -0800
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <45165C9F-FFDC-4525-A77D-FFC8AAE2A404@gbiv.com>
On Feb 14, 2008, at 5:48 AM, Julian Reschke wrote:
> Roy T. Fielding wrote:
>> On Feb 12, 2008, at 8:59 PM, Mark Nottingham wrote:
>>> Roy, if you disagree with consensus on this issue, please suggest  
>>> specific text to replace Julian's work.
>> It isn't consensus until the people who have to change their
>> implementations agree to do so.  The change was applied in a way
>> that I did not anticipate, which made it a new requirement on
>> previously conforming implementations rather than a relaxation
>> of the existing requirements.  The issue did not require that much.
>
> And I don't think we realized that we may be *adding* requirements;  
> the goals were (IMHO) to reduce inconsistency of specs and to allow  
> what UAs indeed do today. It seems the conclusion is that in this  
> case these goals are contradictory.
>
>>    http://www3.tools.ietf.org/wg/httpbis/trac/ticket/20
>> Here is the change that Julian made according to the issue:
>>    http://www3.tools.ietf.org/wg/httpbis/trac/changeset/209
>
> ...which was what the issue resolution asked for; so the mistake  
> happened earlier.

I didn't mean to imply otherwise -- I seriously have no recollection,  
nor
saved messages (I normally save any message that calls for a change), of
the specific diff instructions included in that issue and that you  
applied.
I didn't even know the issue was considered as having a specific  
resolution
until after the changeset was received.  [We really should have a  
separate
status for that in trac.]

All that must have occurred during my personal black-out period.  I  
don't
have any problem with applying the change as suggested -- I have a  
problem
with any claim to calling it consensus (even an emerging consensus is  
a bit
of a stretch).  Silence is not consent.  Even had I been aware of the  
diff
instruction, the effect of the result of the change is not visible until
we look at the remaining words.  We removed all of the exceptions and  
then
were left with MUST be MIME.

> In the meantime I have backed out the change (<http:// 
> www3.tools.ietf.org/wg/httpbis/trac/changeset/211>), and I'd  
> propose to re-open the issue.

Thanks.

>> ...
>> And here is what I suggest for a rewrite, merging both of the above
>> sections under Media Types and inverting the "fantasy island"
>> requirements of the original text to what is permitted in HTTP
>> beyond the registration defaults of MIME.
>> 2.3.1.  Canonicalization and Text Media Types
>>    Internet media types are registered with a canonical form and
>>    defaults for the optional parameter values.  An ideal HTTP
>>    entity-body would contain data formatted strictly according to  
>> that
>>    canonical form.  However, HTTP does not require the sender to  
>> verify
>>    that an entity-body is in canonical form prior to transfer.   
>> Instead,
>>    an HTTP recipient MUST be prepared to accept and properly  
>> interpret
>>    several variances in the format of textual types, as described  
>> below,
>>    and treat other variances as errors.
>
> Good introduction.
>
>>    The "charset" parameter (Section 2.1) is used with some media  
>> types
>>    to indicate the character encoding of the data.  When a media  
>> type is
>>    registered with a default charset value of "US-ASCII", it MAY  
>> be used
>>    to label data transmitted via HTTP in the "iso-8859-1" charset (a
>>    superset of US-ASCII) without including an explicit charset  
>> parameter
>>    on the media type.  In addition, when a media type registered  
>> with a
>>    default charset value of "US-ASCII" is received via HTTP without a
>>    charset parameter or with a charset value of "iso-8859-1", the
>>    recipient MAY inspect the data for indications of a different
>>    character encoding and interpret the data accordingly if the  
>> encoding
>>    is a superset of US-ASCII or if the encoding can be determined  
>> within
>>    the first 16 octets of data and interpreted consistently  
>> thereafter.
>
> Q: so if a text type defines a different default, such as "UTF-8",  
> we disallow defaulting to ISO-8859-1 and sniffing?

Yes.  Error recovery is a different issue (more text on my to-do list)

>>       Note: The first variance is due to a significant portion of  
>> early
>>       HTTP user agents not parsing media type parameters and instead
>>       relying on a then-common default encoding of iso-8859-1.  As a
>>       result, early server implementations avoided the use of charset
>>       parameters and user agents evolved to "sniff" for new character
>>       encodings as the Web expanded beyond iso-8859-1 content.  The
>>       second variance is due to a certain popular user agent that
>>       employed an unsafe encoding detection and switching algorithm
>>       within documents that might contain user-provided data (see
>>       Section security.sniffing), the most common workaround for  
>> which
>>       is to supply a specific charset parameter even when the actual
>>       character encoding is unknown.
>
> Q: so specifying ISO-8859-1 was a countermeasure to broken content  
> sniffing? Is that version of that UA still common enough to have  
> this exception?

MSIE 3-6?  Yes.  Apache received three more XSS reports last month,  
all due
to the same broken browser.  They resurfaced because I removed the  
default
charset setting last year.  I wouldn't be surprised if some other  
browsers
have since copied the same buggy behavior (you know how that works),  
but we
only get reports on one that I can recall.

>>    When in canonical form, media subtypes of the "text" type use  
>> CRLF as
>>    the text line break.  However, it is also commonplace for such  
>> types
>>    to be transmitted in HTTP with CR or LF alone indicating a line
>>    break and occasional for such types to be transmitted with a
>>    character encoding that requires some other set of octet  
>> sequence(s)
>>    to indicate a line break.  HTTP recipients MUST accept and  
>> properly
>>    interpret CRLF, bare CR, and bare LF as indicating a line break  
>> when
>>    encountered within an entity-body received via HTTP that is  
>> labeled
>>    as a text type and provided in a character encoding that allows  
>> CRLF
>>    to indicate a line break.
>>       Note: Line breaks are specified in MIME with the expectation  
>> that
>>       they are enforced during email message composition, when it is
>>       scalable to ensure that every octet is placed in canonical  
>> form,
>>       and with the anticipation that a message may be transmitted or
>>       processed using line-oriented protocols.  HTTP message  
>> generation,
>>       in contrast, is usually performed at high speed, encloses data
>>       that cannot be modified without also altering its metadata, and
>>       is processed using length-delimited protocols.
>
> Ok.
>
> Sounds good to me, except for:
>
> - Q: do we need to state that we overrule RFC3023 (WRT default  
> charset for text/xml?)

No.  RFC 3023 defines a couple media types for email and is therefore
incapable of going beyond MIME's requirements on text/* (even though
most MIME clients nowadays do their own sniffing).

> - we still need the security consideration WRT sniffing.

Yep, ran out of time.

....Roy
Received on Thursday, 14 February 2008 19:16:47 UTC