W3C home > Mailing lists > Public > ietf-http-wg@w3.org > January to March 2008

Re: Unknown text/* subtypes [i20]

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Thu, 14 Feb 2008 18:18:03 +0100
To: ietf-http-wg@w3.org
Message-ID: <fp1t12$n0e$1@ger.gmane.org>

Roy T. Fielding wrote:

 [p3 version] 
>| Some HTTP/1.0 software has interpreted a Content-Type header without
>| charset parameter incorrectly to mean "recipient should guess."
>| Senders wishing to defeat this behavior MAY include a charset
>| parameter even when the charset is ISO-8859-1 ([ISO-8859-1]) and
>| SHOULD do so when it is known that it will not confuse the  
>| recipient.

When senders wish to defeat that s/behavior/misbehavior/ they
s/MAY/have to/ include a charset parameter.  It is no OPTION
after a decision to try it.   

The optional part could be s/Senders wishing/Senders MAY wish/,
but a recommendation is clearer:  "Senbers SHOULD avoid this
misbehavior by including a charset parameter." (period)

Strike the "even when" Latin-1 blurb, as that's precisely the
cause of this historical mess.  Explicit Latin-1 really *is*
Latin-1, and where that is not the case (e.g. windows-1252)
it is broken.  Specifying varying degrees of brokenness is a
dubious idea, maybe put it in a note:

"Note:  Historically Latin-1 ([ISO-8859-1]) was a predominant
charset, and some senders explicitly announced this charset
even when it was incorrect."  The conclusions for readers and
implementors are obvious, there be dragons.

>| and SHOULD do so when it is known that it will not confuse
>| the recipient.

Strike that, nobody knows what confuses recipients.  Some old
Mosaic browsers have to bite, a reasonable lines is "if an UA
cannot do Host: header fields it is hopeless".  

>| Unfortunately, some older HTTP/1.0 clients did not deal
>| properly with an explicit charset parameter.

Yes, but it doesn't affect what servers do in this millennium,
unless you want it as justification for an explicit "MAY omit
Latin-1" in 2616bis.  Going back to the start this could be 
the excuse where senders violate (see above) "Senders SHOULD
avoid this misbehavior by including a charset parameter."

Noting the good excuses to violate a SHOULD makes sense.  But
in this case limited to HTTP/1.0 and some "hopeless" browsers
I think 2616bis can get away without convoluted explanations.

>| HTTP/1.1 recipients MUST respect the charset label provided
>| by the sender; and those user agents that have a provision to
>| "guess" a charset MUST use the charset from the content-type
>| field if they support that charset, rather than the
>| recipient's preference, when initially displaying a document.

I don't see remotely why recipients "MUST" do this, it is just
information, with a proposed historical note explaining a case
where the information "Latin-1" could be wrong.

How clients display documents is not the business of HTTP, that
task is minimally two protocol layers above HTTP.  Let's strike
this paragraph, the word *display* triggered my bogon detector.

>| Internet media types are registered with a canonical form.  An
>| entity-body transferred via HTTP messages MUST be represented
>| in the appropriate canonical form prior to its transmission
>| except for "text" types, as defined in the next paragraph.

s/except for/with the possible exception of/  After all HTTP
still "allows" to use CRLF for canonical line ends in text/*.

>| HTTP relaxes this requirement and allows the transport of
>| text media with plain CR or LF alone representing a line
>| break when it is done consistently for an entire entity-body.

s/relaxes/does not depend on/ and then just say "with other
line break indicators including but not limited to bare LF".

The "bare CR" case is for a now historical platform, and IMO
2616bis doesn't need to talk about "bare CR" explicitly (?)  

>| HTTP applications MUST accept CRLF, bare CR, and bare LF
>| as being representative of a line break in text media
>| received via HTTP.

I think what you really want is "MUST NOT modify other line
break conventions on the fly", as opposed to non-binary FTP.

>| if the text is represented in a character set that does
>| not use octets 13 and 10 for CR and LF respectively, as
>| is the case for some multi-byte character sets

What about NL in text/xml Latin-1, a charset offering CRLF ?
Or FWIW in UTF-1 ?  A note about one representative case
where the octets 0A does not mean LF should suffice to make
the point, proposal:

"Note that octet 10 (decimal) does not necessarily mean LF
 (u+000A) in various charsets, e.g., u+010A in UTF-16."

Trim the rest, keeping something in the direction of:

>| This flexibility regarding line breaks applies only to
>| text media in the entity-body; a bare CR or LF MUST NOT
>| be substituted for CRLF within any of the HTTP control
>| structures (such as header fields and multipart boundaries).

Back to the subject of this thread:

>| When no explicit charset parameter is provided by the
>| sender, media subtypes of the "text" type are defined to
>| have a default charset value of "ISO-8859-1" when received
>| via HTTP.

s/are defined to have a default/used to have a default/.  Add:
"This HTTP/1.0 work around for historic browsers choking on
 an explicit charset ISO-8859-1 is not more needed, senders
 SHOULD (see 2.1.1) label ISO-8859-1 explicitly."

>| Data in character sets other than "ISO-8859-1" or its 
>| subsets MUST be labeled with an appropriate charset value.
>| See Section 2.1.1 for compatibility problems.

ACK.  Potential issues in your version:

>: When a media type is registered with a default charset value
>: of "US-ASCII", it MAY be used to label data transmitted via
>: HTTP in the "iso-8859-1" charset (a superset of US-ASCII)
>: without including an explicit charset parameter on the media
>: type.

For 2616bis that should be no valid option (MAY), it should be
a *violation* of a new SHOULD for the stated historical reason.
Going from MAY to SHOULD NOT is possible, nothing breaks.

>: In addition, when a media type registered with a default
>: charset value of "US-ASCII" is received via HTTP without a
>: charset parameter or with a charset value of "iso-8859-1",
>: the recipient MAY inspect the data for indications of a
>: different character encoding

That is convoluted.  Certainly it "MAY" try to determine the
charset by sniffing if there is no charset, arguably it "must"
(lower case) do this for the (non-HTTP) purpose of displaying
a document.  And it "MAY" do this whenever it wishes, the case
of an erroneous iso-8859-1 IMO does not justify a HTTP "MAY".

As far as HTTP is concerned an explicit charset means what it
says, including charset="iso-8859-1".  Where that is incorrect
it is an ordinary bug on the side of the sender.  Limit this
oddity to a note (as proposed above)

>: if the encoding can be determined within the first 16 octets
>: of data and interpreted consistently thereafter.

Please no arbitrary magic numbers like "16" in a standard, let
alone in a standard where the complete "sniffing" business is
off topic.

>: Note: The first variance is due to a significant portion of
>: early HTTP user agents not parsing media type parameters and
>: instead relying on a then-common default encoding of iso-8859-1.
>: As a result, early server implementations avoided the use of
>: charset parameters and user agents evolved to "sniff" for new
>: character encodings as the Web expanded beyond iso-8859-1
>: content.

Yes, and (as you noted in another article) servers have no time 
for any sniffing on their side for dynamical content.  But that
does not justify a "variance" going as far as an option (MAY),
violating a SHOULD NOT is good enough for this historical case.

I don't see why 2616bis should try to overrule text/xml defaults
with a MAY, as HTTP certainly does not try to tell clients what
a say image/x-icon might be, and how to display it.

>: The second variance is due to a certain popular user agent that
>: employed an unsafe encoding detection and switching algorithm
>: within documents that might contain user-provided data (see
>: Section security.sniffing), the most common workaround for
>: which is to supply a specific charset parameter even when the
>: actual character encoding is unknown.

No.  Plausible reasons why servers might intentionally lie with
"iso-8859-1" do not belong in an Internet standard.  If an UA is
broken it needs to be fixed.  Servers could also try their luck
with the registered "unknown-8bit" instead of lying, this is out
of scope for HTTP.

Received on Thursday, 14 February 2008 17:16:38 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 27 April 2012 06:50:37 GMT