Re: Unknown text/* subtypes [i20]

My .02 -

Overall, I'm not intensely happy with this, but it does seem like the  
most practical way forward.

My biggest concern is that it places some fairly wide-reaching MUST- 
level requirements. Would downgrading them to SHOULD be workable? Can  
we refine their targets, e.g., instead of applying them to all  
recipients, target them at user-agents (and maybe origin servers too)  
as recipients?

We're also still needing the security considerations text WRT UTF-7,  
correct?

A few more editorial comments inline -

On 14/02/2008, at 2:39 PM, Roy T. Fielding wrote:
>
> 2.3.1.  Canonicalization and Text Media Types
>
>   Internet media types are registered with a canonical form and
>   defaults for the optional parameter values.  An ideal HTTP
>   entity-body would contain data formatted strictly according to that
>   canonical form.  However, HTTP does not require the sender to verify
>   that an entity-body is in canonical form prior to transfer.   
> Instead,
>   an HTTP recipient MUST be prepared to accept and properly interpret
>   several variances in the format of textual types, as described  
> below,
>   and treat other variances as errors.

This is a MUST, but the requirements about encoding below are MAYs,  
which is a bit odd...

>   The "charset" parameter (Section 2.1) is used with some media types
>   to indicate the character encoding of the data.  When a media type  
> is
>   registered with a default charset value of "US-ASCII", it MAY be  
> used
>   to label data transmitted via HTTP in the "iso-8859-1" charset (a
>   superset of US-ASCII) without including an explicit charset  
> parameter
>   on the media type.

This sentence doesn't read well; what is 'it'? 'label' is also not  
quite right, suggest 'indicate'. Also, who does the MAY apply to?

>  In addition, when a media type registered with a
>   default charset value of "US-ASCII" is received via HTTP without a
>   charset parameter or with a charset value of "iso-8859-1", the
>   recipient MAY inspect the data for indications of a different
>   character encoding and interpret the data accordingly if the  
> encoding
>   is a superset of US-ASCII or if the encoding can be determined  
> within
>   the first 16 octets of data and interpreted consistently thereafter.

This sentence is also very difficult. It may help to insert another  
MAY in between 'and' and 'interpret'.

>      Note: The first variance is due to a significant portion of early
>      HTTP user agents not parsing media type parameters and instead
>      relying on a then-common default encoding of iso-8859-1.  As a
>      result, early server implementations avoided the use of charset
>      parameters and user agents evolved to "sniff" for new character
>      encodings as the Web expanded beyond iso-8859-1 content.  The
>      second variance is due to a certain popular user agent that
>      employed an unsafe encoding detection and switching algorithm
>      within documents that might contain user-provided data (see
>      Section security.sniffing), the most common workaround for which
>      is to supply a specific charset parameter even when the actual
>      character encoding is unknown.
>
>   When in canonical form, media subtypes of the "text" type use CRLF  
> as
>   the text line break.  However, it is also commonplace for such types
>   to be transmitted in HTTP with CR or LF alone indicating a line
>   break and occasional for such types to be transmitted with a
>   character encoding that requires some other set of octet sequence(s)
>   to indicate a line break.  HTTP recipients MUST accept and properly
>   interpret CRLF, bare CR, and bare LF as indicating a line break when
>   encountered within an entity-body received via HTTP that is labeled
>   as a text type and provided in a character encoding that allows CRLF
>   to indicate a line break.
>
>      Note: Line breaks are specified in MIME with the expectation that
>      they are enforced during email message composition, when it is
>      scalable to ensure that every octet is placed in canonical form,
>      and with the anticipation that a message may be transmitted or
>      processed using line-oriented protocols.  HTTP message  
> generation,
>      in contrast, is usually performed at high speed, encloses data
>      that cannot be modified without also altering its metadata, and
>      is processed using length-delimited protocols.


--
Mark Nottingham     http://www.mnot.net/

Received on Tuesday, 11 March 2008 05:51:08 UTC