- From: Frank Ellermann <nobody@xyzzy.claranet.de>
- Date: Thu, 14 Feb 2008 18:18:03 +0100
- To: ietf-http-wg@w3.org
Roy T. Fielding wrote: [p3 version] >| Some HTTP/1.0 software has interpreted a Content-Type header without >| charset parameter incorrectly to mean "recipient should guess." >| Senders wishing to defeat this behavior MAY include a charset >| parameter even when the charset is ISO-8859-1 ([ISO-8859-1]) and >| SHOULD do so when it is known that it will not confuse the >| recipient. When senders wish to defeat that s/behavior/misbehavior/ they s/MAY/have to/ include a charset parameter. It is no OPTION after a decision to try it. The optional part could be s/Senders wishing/Senders MAY wish/, but a recommendation is clearer: "Senbers SHOULD avoid this misbehavior by including a charset parameter." (period) Strike the "even when" Latin-1 blurb, as that's precisely the cause of this historical mess. Explicit Latin-1 really *is* Latin-1, and where that is not the case (e.g. windows-1252) it is broken. Specifying varying degrees of brokenness is a dubious idea, maybe put it in a note: "Note: Historically Latin-1 ([ISO-8859-1]) was a predominant charset, and some senders explicitly announced this charset even when it was incorrect." The conclusions for readers and implementors are obvious, there be dragons. >| and SHOULD do so when it is known that it will not confuse >| the recipient. Strike that, nobody knows what confuses recipients. Some old Mosaic browsers have to bite, a reasonable lines is "if an UA cannot do Host: header fields it is hopeless". >| Unfortunately, some older HTTP/1.0 clients did not deal >| properly with an explicit charset parameter. Yes, but it doesn't affect what servers do in this millennium, unless you want it as justification for an explicit "MAY omit Latin-1" in 2616bis. Going back to the start this could be the excuse where senders violate (see above) "Senders SHOULD avoid this misbehavior by including a charset parameter." Noting the good excuses to violate a SHOULD makes sense. But in this case limited to HTTP/1.0 and some "hopeless" browsers I think 2616bis can get away without convoluted explanations. >| HTTP/1.1 recipients MUST respect the charset label provided >| by the sender; and those user agents that have a provision to >| "guess" a charset MUST use the charset from the content-type >| field if they support that charset, rather than the >| recipient's preference, when initially displaying a document. I don't see remotely why recipients "MUST" do this, it is just information, with a proposed historical note explaining a case where the information "Latin-1" could be wrong. How clients display documents is not the business of HTTP, that task is minimally two protocol layers above HTTP. Let's strike this paragraph, the word *display* triggered my bogon detector. >| Internet media types are registered with a canonical form. An >| entity-body transferred via HTTP messages MUST be represented >| in the appropriate canonical form prior to its transmission >| except for "text" types, as defined in the next paragraph. s/except for/with the possible exception of/ After all HTTP still "allows" to use CRLF for canonical line ends in text/*. >| HTTP relaxes this requirement and allows the transport of >| text media with plain CR or LF alone representing a line >| break when it is done consistently for an entire entity-body. s/relaxes/does not depend on/ and then just say "with other line break indicators including but not limited to bare LF". The "bare CR" case is for a now historical platform, and IMO 2616bis doesn't need to talk about "bare CR" explicitly (?) >| HTTP applications MUST accept CRLF, bare CR, and bare LF >| as being representative of a line break in text media >| received via HTTP. I think what you really want is "MUST NOT modify other line break conventions on the fly", as opposed to non-binary FTP. >| if the text is represented in a character set that does >| not use octets 13 and 10 for CR and LF respectively, as >| is the case for some multi-byte character sets What about NL in text/xml Latin-1, a charset offering CRLF ? Or FWIW in UTF-1 ? A note about one representative case where the octets 0A does not mean LF should suffice to make the point, proposal: "Note that octet 10 (decimal) does not necessarily mean LF (u+000A) in various charsets, e.g., u+010A in UTF-16." Trim the rest, keeping something in the direction of: >| This flexibility regarding line breaks applies only to >| text media in the entity-body; a bare CR or LF MUST NOT >| be substituted for CRLF within any of the HTTP control >| structures (such as header fields and multipart boundaries). [...] Back to the subject of this thread: >| When no explicit charset parameter is provided by the >| sender, media subtypes of the "text" type are defined to >| have a default charset value of "ISO-8859-1" when received >| via HTTP. s/are defined to have a default/used to have a default/. Add: "This HTTP/1.0 work around for historic browsers choking on an explicit charset ISO-8859-1 is not more needed, senders SHOULD (see 2.1.1) label ISO-8859-1 explicitly." >| Data in character sets other than "ISO-8859-1" or its >| subsets MUST be labeled with an appropriate charset value. >| See Section 2.1.1 for compatibility problems. ACK. Potential issues in your version: >: When a media type is registered with a default charset value >: of "US-ASCII", it MAY be used to label data transmitted via >: HTTP in the "iso-8859-1" charset (a superset of US-ASCII) >: without including an explicit charset parameter on the media >: type. For 2616bis that should be no valid option (MAY), it should be a *violation* of a new SHOULD for the stated historical reason. Going from MAY to SHOULD NOT is possible, nothing breaks. >: In addition, when a media type registered with a default >: charset value of "US-ASCII" is received via HTTP without a >: charset parameter or with a charset value of "iso-8859-1", >: the recipient MAY inspect the data for indications of a >: different character encoding [...] That is convoluted. Certainly it "MAY" try to determine the charset by sniffing if there is no charset, arguably it "must" (lower case) do this for the (non-HTTP) purpose of displaying a document. And it "MAY" do this whenever it wishes, the case of an erroneous iso-8859-1 IMO does not justify a HTTP "MAY". As far as HTTP is concerned an explicit charset means what it says, including charset="iso-8859-1". Where that is incorrect it is an ordinary bug on the side of the sender. Limit this oddity to a note (as proposed above) >: if the encoding can be determined within the first 16 octets >: of data and interpreted consistently thereafter. Please no arbitrary magic numbers like "16" in a standard, let alone in a standard where the complete "sniffing" business is off topic. >: Note: The first variance is due to a significant portion of >: early HTTP user agents not parsing media type parameters and >: instead relying on a then-common default encoding of iso-8859-1. >: As a result, early server implementations avoided the use of >: charset parameters and user agents evolved to "sniff" for new >: character encodings as the Web expanded beyond iso-8859-1 >: content. Yes, and (as you noted in another article) servers have no time for any sniffing on their side for dynamical content. But that does not justify a "variance" going as far as an option (MAY), violating a SHOULD NOT is good enough for this historical case. I don't see why 2616bis should try to overrule text/xml defaults with a MAY, as HTTP certainly does not try to tell clients what a say image/x-icon might be, and how to display it. >: The second variance is due to a certain popular user agent that >: employed an unsafe encoding detection and switching algorithm >: within documents that might contain user-provided data (see >: Section security.sniffing), the most common workaround for >: which is to supply a specific charset parameter even when the >: actual character encoding is unknown. No. Plausible reasons why servers might intentionally lie with "iso-8859-1" do not belong in an Internet standard. If an UA is broken it needs to be fixed. Servers could also try their luck with the registered "unknown-8bit" instead of lying, this is out of scope for HTTP. Frank
Received on Thursday, 14 February 2008 17:16:38 UTC