Re: Unknown text/* subtypes [i20] from Frank Ellermann on 2008-02-14 (ietf-http-wg@w3.org from January to March 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Thu, 14 Feb 2008 18:18:03 +0100
To: ietf-http-wg@w3.org
Message-ID: <fp1t12$n0e$1@ger.gmane.org>
Roy T. Fielding wrote:

 [p3 version] 
>| Some HTTP/1.0 software has interpreted a Content-Type header without
>| charset parameter incorrectly to mean "recipient should guess."
>| Senders wishing to defeat this behavior MAY include a charset
>| parameter even when the charset is ISO-8859-1 ([ISO-8859-1]) and
>| SHOULD do so when it is known that it will not confuse the  
>| recipient.

When senders wish to defeat that s/behavior/misbehavior/ they
s/MAY/have to/ include a charset parameter.  It is no OPTION
after a decision to try it.   

The optional part could be s/Senders wishing/Senders MAY wish/,
but a recommendation is clearer:  "Senbers SHOULD avoid this
misbehavior by including a charset parameter." (period)

Strike the "even when" Latin-1 blurb, as that's precisely the
cause of this historical mess.  Explicit Latin-1 really *is*
Latin-1, and where that is not the case (e.g. windows-1252)
it is broken.  Specifying varying degrees of brokenness is a
dubious idea, maybe put it in a note:

"Note:  Historically Latin-1 ([ISO-8859-1]) was a predominant
charset, and some senders explicitly announced this charset
even when it was incorrect."  The conclusions for readers and
implementors are obvious, there be dragons.

>| and SHOULD do so when it is known that it will not confuse
>| the recipient.

Strike that, nobody knows what confuses recipients.  Some old
Mosaic browsers have to bite, a reasonable lines is "if an UA
cannot do Host: header fields it is hopeless".  

>| Unfortunately, some older HTTP/1.0 clients did not deal
>| properly with an explicit charset parameter.

Yes, but it doesn't affect what servers do in this millennium,
unless you want it as justification for an explicit "MAY omit
Latin-1" in 2616bis.  Going back to the start this could be 
the excuse where senders violate (see above) "Senders SHOULD
avoid this misbehavior by including a charset parameter."

Noting the good excuses to violate a SHOULD makes sense.  But
in this case limited to HTTP/1.0 and some "hopeless" browsers
I think 2616bis can get away without convoluted explanations.

>| HTTP/1.1 recipients MUST respect the charset label provided
>| by the sender; and those user agents that have a provision to
>| "guess" a charset MUST use the charset from the content-type
>| field if they support that charset, rather than the
>| recipient's preference, when initially displaying a document.

I don't see remotely why recipients "MUST" do this, it is just
information, with a proposed historical note explaining a case
where the information "Latin-1" could be wrong.

How clients display documents is not the business of HTTP, that
task is minimally two protocol layers above HTTP.  Let's strike
this paragraph, the word *display* triggered my bogon detector.

>| Internet media types are registered with a canonical form.  An
>| entity-body transferred via HTTP messages MUST be represented
>| in the appropriate canonical form prior to its transmission
>| except for "text" types, as defined in the next paragraph.

s/except for/with the possible exception of/  After all HTTP
still "allows" to use CRLF for canonical line ends in text/*.

>| HTTP relaxes this requirement and allows the transport of
>| text media with plain CR or LF alone representing a line
>| break when it is done consistently for an entire entity-body.

s/relaxes/does not depend on/ and then just say "with other
line break indicators including but not limited to bare LF".

The "bare CR" case is for a now historical platform, and IMO
2616bis doesn't need to talk about "bare CR" explicitly (?)  

>| HTTP applications MUST accept CRLF, bare CR, and bare LF
>| as being representative of a line break in text media
>| received via HTTP.

I think what you really want is "MUST NOT modify other line
break conventions on the fly", as opposed to non-binary FTP.

>| if the text is represented in a character set that does
>| not use octets 13 and 10 for CR and LF respectively, as
>| is the case for some multi-byte character sets

What about NL in text/xml Latin-1, a charset offering CRLF ?
Or FWIW in UTF-1 ?  A note about one representative case
where the octets 0A does not mean LF should suffice to make
the point, proposal:

"Note that octet 10 (decimal) does not necessarily mean LF
 (u+000A) in various charsets, e.g., u+010A in UTF-16."

Trim the rest, keeping something in the direction of:

>| This flexibility regarding line breaks applies only to
>| text media in the entity-body; a bare CR or LF MUST NOT
>| be substituted for CRLF within any of the HTTP control
>| structures (such as header fields and multipart boundaries).
[...]

Back to the subject of this thread:

>| When no explicit charset parameter is provided by the
>| sender, media subtypes of the "text" type are defined to
>| have a default charset value of "ISO-8859-1" when received
>| via HTTP.

s/are defined to have a default/used to have a default/.  Add:
"This HTTP/1.0 work around for historic browsers choking on
 an explicit charset ISO-8859-1 is not more needed, senders
 SHOULD (see 2.1.1) label ISO-8859-1 explicitly."

>| Data in character sets other than "ISO-8859-1" or its 
>| subsets MUST be labeled with an appropriate charset value.
>| See Section 2.1.1 for compatibility problems.

ACK.  Potential issues in your version:

>: When a media type is registered with a default charset value
>: of "US-ASCII", it MAY be used to label data transmitted via
>: HTTP in the "iso-8859-1" charset (a superset of US-ASCII)
>: without including an explicit charset parameter on the media
>: type.

For 2616bis that should be no valid option (MAY), it should be
a *violation* of a new SHOULD for the stated historical reason.
Going from MAY to SHOULD NOT is possible, nothing breaks.

>: In addition, when a media type registered with a default
>: charset value of "US-ASCII" is received via HTTP without a
>: charset parameter or with a charset value of "iso-8859-1",
>: the recipient MAY inspect the data for indications of a
>: different character encoding
[...]

That is convoluted.  Certainly it "MAY" try to determine the
charset by sniffing if there is no charset, arguably it "must"
(lower case) do this for the (non-HTTP) purpose of displaying
a document.  And it "MAY" do this whenever it wishes, the case
of an erroneous iso-8859-1 IMO does not justify a HTTP "MAY".

As far as HTTP is concerned an explicit charset means what it
says, including charset="iso-8859-1".  Where that is incorrect
it is an ordinary bug on the side of the sender.  Limit this
oddity to a note (as proposed above)

>: if the encoding can be determined within the first 16 octets
>: of data and interpreted consistently thereafter.

Please no arbitrary magic numbers like "16" in a standard, let
alone in a standard where the complete "sniffing" business is
off topic.

>: Note: The first variance is due to a significant portion of
>: early HTTP user agents not parsing media type parameters and
>: instead relying on a then-common default encoding of iso-8859-1.
>: As a result, early server implementations avoided the use of
>: charset parameters and user agents evolved to "sniff" for new
>: character encodings as the Web expanded beyond iso-8859-1
>: content.

Yes, and (as you noted in another article) servers have no time 
for any sniffing on their side for dynamical content.  But that
does not justify a "variance" going as far as an option (MAY),
violating a SHOULD NOT is good enough for this historical case.

I don't see why 2616bis should try to overrule text/xml defaults
with a MAY, as HTTP certainly does not try to tell clients what
a say image/x-icon might be, and how to display it.

>: The second variance is due to a certain popular user agent that
>: employed an unsafe encoding detection and switching algorithm
>: within documents that might contain user-provided data (see
>: Section security.sniffing), the most common workaround for
>: which is to supply a specific charset parameter even when the
>: actual character encoding is unknown.

No.  Plausible reasons why servers might intentionally lie with
"iso-8859-1" do not belong in an Internet standard.  If an UA is
broken it needs to be fixed.  Servers could also try their luck
with the registered "unknown-8bit" instead of lying, this is out
of scope for HTTP.

 Frank
Received on Thursday, 14 February 2008 17:16:38 UTC