Re: Unknown text/* subtypes [i20]

On Feb 12, 2008, at 8:59 PM, Mark Nottingham wrote:
> Roy, if you disagree with consensus on this issue, please suggest  
> specific text to replace Julian's work.

It isn't consensus until the people who have to change their
implementations agree to do so.  The change was applied in a way
that I did not anticipate, which made it a new requirement on
previously conforming implementations rather than a relaxation
of the existing requirements.  The issue did not require that much.

    http://www3.tools.ietf.org/wg/httpbis/trac/ticket/20

Here is the change that Julian made according to the issue:

    http://www3.tools.ietf.org/wg/httpbis/trac/changeset/209

[2.1.1 is deleted; the last para of 2.3.1 is replaced with

    HTTP/1.1 recipients MUST respect the charset label provided by the
    sender; and those user agents that have a provision to "guess" a  
charset
    MUST use the charset from the content-type field if they support  
that
    charset, rather than the recipient's preference, when initially  
displaying
    a document.
]

Here is what it said in p3 before that change:

2.1.1.  Missing Charset

    Some HTTP/1.0 software has interpreted a Content-Type header without
    charset parameter incorrectly to mean "recipient should guess."
    Senders wishing to defeat this behavior MAY include a charset
    parameter even when the charset is ISO-8859-1 ([ISO-8859-1]) and
    SHOULD do so when it is known that it will not confuse the  
recipient.

    Unfortunately, some older HTTP/1.0 clients did not deal properly  
with
    an explicit charset parameter.  HTTP/1.1 recipients MUST respect the
    charset label provided by the sender; and those user agents that  
have
    a provision to "guess" a charset MUST use the charset from the
    content-type field if they support that charset, rather than the
    recipient's preference, when initially displaying a document.  See
    Section 2.3.1.
...

2.3.1.  Canonicalization and Text Defaults

    Internet media types are registered with a canonical form.  An
    entity-body transferred via HTTP messages MUST be represented in the
    appropriate canonical form prior to its transmission except for
    "text" types, as defined in the next paragraph.

    When in canonical form, media subtypes of the "text" type use  
CRLF as
    the text line break.  HTTP relaxes this requirement and allows the
    transport of text media with plain CR or LF alone representing a  
line
    break when it is done consistently for an entire entity-body.  HTTP
    applications MUST accept CRLF, bare CR, and bare LF as being
    representative of a line break in text media received via HTTP.  In
    addition, if the text is represented in a character set that does  
not
    use octets 13 and 10 for CR and LF respectively, as is the case for
    some multi-byte character sets, HTTP allows the use of whatever  
octet
    sequences are defined by that character set to represent the
    equivalent of CR and LF for line breaks.  This flexibility regarding
    line breaks applies only to text media in the entity-body; a bare CR
    or LF MUST NOT be substituted for CRLF within any of the HTTP  
control
    structures (such as header fields and multipart boundaries).

    If an entity-body is encoded with a content-coding, the underlying
    data MUST be in a form defined above prior to being encoded.

    The "charset" parameter is used with some media types to define the
    character set (Section 2.1) of the data.  When no explicit charset
    parameter is provided by the sender, media subtypes of the "text"
    type are defined to have a default charset value of "ISO-8859-1"  
when
    received via HTTP.  Data in character sets other than  
"ISO-8859-1" or
    its subsets MUST be labeled with an appropriate charset value.  See
    Section 2.1.1 for compatibility problems.

================

And here is what I suggest for a rewrite, merging both of the above
sections under Media Types and inverting the "fantasy island"
requirements of the original text to what is permitted in HTTP
beyond the registration defaults of MIME.

2.3.1.  Canonicalization and Text Media Types

    Internet media types are registered with a canonical form and
    defaults for the optional parameter values.  An ideal HTTP
    entity-body would contain data formatted strictly according to that
    canonical form.  However, HTTP does not require the sender to verify
    that an entity-body is in canonical form prior to transfer.   
Instead,
    an HTTP recipient MUST be prepared to accept and properly interpret
    several variances in the format of textual types, as described  
below,
    and treat other variances as errors.

    The "charset" parameter (Section 2.1) is used with some media types
    to indicate the character encoding of the data.  When a media  
type is
    registered with a default charset value of "US-ASCII", it MAY be  
used
    to label data transmitted via HTTP in the "iso-8859-1" charset (a
    superset of US-ASCII) without including an explicit charset  
parameter
    on the media type.  In addition, when a media type registered with a
    default charset value of "US-ASCII" is received via HTTP without a
    charset parameter or with a charset value of "iso-8859-1", the
    recipient MAY inspect the data for indications of a different
    character encoding and interpret the data accordingly if the  
encoding
    is a superset of US-ASCII or if the encoding can be determined  
within
    the first 16 octets of data and interpreted consistently thereafter.

       Note: The first variance is due to a significant portion of early
       HTTP user agents not parsing media type parameters and instead
       relying on a then-common default encoding of iso-8859-1.  As a
       result, early server implementations avoided the use of charset
       parameters and user agents evolved to "sniff" for new character
       encodings as the Web expanded beyond iso-8859-1 content.  The
       second variance is due to a certain popular user agent that
       employed an unsafe encoding detection and switching algorithm
       within documents that might contain user-provided data (see
       Section security.sniffing), the most common workaround for which
       is to supply a specific charset parameter even when the actual
       character encoding is unknown.

    When in canonical form, media subtypes of the "text" type use  
CRLF as
    the text line break.  However, it is also commonplace for such types
    to be transmitted in HTTP with CR or LF alone indicating a line
    break and occasional for such types to be transmitted with a
    character encoding that requires some other set of octet sequence(s)
    to indicate a line break.  HTTP recipients MUST accept and properly
    interpret CRLF, bare CR, and bare LF as indicating a line break when
    encountered within an entity-body received via HTTP that is labeled
    as a text type and provided in a character encoding that allows CRLF
    to indicate a line break.

       Note: Line breaks are specified in MIME with the expectation that
       they are enforced during email message composition, when it is
       scalable to ensure that every octet is placed in canonical form,
       and with the anticipation that a message may be transmitted or
       processed using line-oriented protocols.  HTTP message  
generation,
       in contrast, is usually performed at high speed, encloses data
       that cannot be modified without also altering its metadata, and
       is processed using length-delimited protocols.

=====================

> In the future, when you don't agree with emerging consensus, I'd  
> appreciate it if you tell us as soon as is practical.

This is as soon as practical.  The last discussion of it took place
the day before I got hit by the bronchitis fever, and I did disagree
with the proposal at that time.

....Roy

Received on Thursday, 14 February 2008 03:39:38 UTC