- From: Roy T. Fielding <fielding@apache.org>
- Date: Wed, 9 Jul 2003 22:03:54 +0200
- To: "Ian B. Jacobs" <ij@w3.org>
- Cc: www-tag@w3.org
Sorry for the late comments, but the following is incorrect: For this reason, servers should only supply a character encoding header when there is complete certainty as to the encoding in use. First, character encoding is a parameter of the media type value that is provided by the MIME and HTTP Content-Type header field. There is no character encoding header. Second, the charset parameter is usually supplied by the server if the security checks it places on the content are dependent upon the configured character encoding for that content. This strategy exists because of security flaws in deployed browsers that allow auto-selection of character encoding to change the interpretation of certain fields from raw data to executable content. So, contrary to this finding, such servers must provide a default charset parameter to work around security flaws and, in particular, the boneheaded way that browsers try to autoselect character encoding, which is not recommended by HTTP. Otherwise, an error will cause a perfectly usable representation to be rejected by an architecturally sound client. Section 7.1 of [RFC3023] states: It isn't a perfectly usable representation. It is a configuration mismatch between the intended charset and the actual charset. This is no different from the error regarding mislabeling the media type -- the correct action is for the client to refuse to render the content unless the workaround is approved by the user. Otherwise, the content will remain mislabeled. The use of the charset parameter is STRONGLY RECOMMENDED, since this information can be used by XML processors to determine authoritatively the charset of the XML MIME entity. However, a receiving application can, with very high reliability, determine the character encoding of an XML document by reading it Sorry, that is completely false. Folks should read the number of security vulnerabilities caused by such thinking before declaring that it is the case. The purpose of the charset parameter is to reduce the complexity of implementations so that they don't need to read the content character-by-character to determine the character encoding. The only time it should not be provided is when the content contains multiple character encodings, and even then there should be a standard way of indicating that as part of the media type value. BTW, on a related point, I will note that the W3C working groups responsible for all of the exceptions requested on this point have still failed to register their media types with IANA. I just spent an hour digging though the W3C site to pick up some of these types for the Apache configuration file, since I am tired of waiting for the appropriate authors. People claiming that the registration process is slow should be ashamed of themseleves -- there are dozens of new types since the last update with far less applicability and deployment. The only organization that seems incapable of registering deployed types is the W3C. Whatever the problem is, it sure as heck isn't the IANA process. Finally, referring to representation metadata as "MIME headers" is only applicable to e-mail. They are called different things in HTTP and NNTP, even though they share the same field names. In general, they should be referred to as metadata that defines how the content is to be interpreted by the recipient. Where appropriate, the specific field name "Content-Type" should be described in quotes, and its value should always be described as an Internet media type (MIME type is a term that was deprecated eight years ago). ....Roy
Received on Wednesday, 9 July 2003 16:03:54 UTC