- From: Yutaka Oiwa <y.oiwa@aist.go.jp>
- Date: Wed, 23 Jan 2008 11:26:15 +0900
- To: Mark Nottingham <mnot@mnot.net>
- CC: Julian Reschke <julian.reschke@gmx.de>, "'HTTP Working Group'" <ietf-http-wg@w3.org>
> To be clear, we're talking about removing > <http://tools.ietf.org/id/draft-ietf-httpbis-p3-payload-01.txt>, section > 2.3.1, the entire forth paragraph (i.e., the last one in that section). > This includes removing both the defaulting and the MUST-level > requirement for labeling text/* in a charset other than ISO-8859-1. In general, I agree for dropping "ISO-8859-1" default for text/* content types, however, for "text/html" I have a specific concern with that. As mentioned by many other people, many current browsers ignore HTTP/1.1 specification and implement charset auto-detection and <meta http-equiv> tag detection. This has caused several cross-site scripting vulnerabilities. The way of the attack is to insert an ASCII byte sequence which looks like UTF-7 escaped string at some earlier point of the documents (where browsers uses for character set detection), and insert a UTF-7-encoded <script> tag in the documents. The most effective countermeasure to this attack is declaring charset in the HTTP header. However, there are some issues about that: * It is not always possible to declare charsets in HTTP headers, especially for static contents. * Charsets are somewhat "open" standard, at least from the viewpoint of HTTP WG and W3C. It is not possible to ban future problematic charsets (e.g. UTF-7) from being defined. * Charset auto-detection and <meta http-equiv="content-type"> charset detection interfere each other. However, it is almost impossible to specify the detailed behavior of charset detection algorithms. * Existing ASCII-based applications should be kept safe for backward compatibility, at least in the specification level. There are number of ways to solve this, and my current preference is to add the following restrictions regarding charset auto-detection: * If charset is declared in the header, it MUST be honored. (current requirement in 2.1.1 may be copied). * If charset is not declared in the header, clients MAY guess the charset of the payload by any means (e.g. by examining the payload octets, using special attributions defined for content-types, or using the client-defined defaults). However, if the payload is composed solely by octets representing ASCII printable characters and HTML-defined control characters (CR, LF, HT, VT and SP), it MUST be treated as if it is in ASCII or equivalent character sets. If the payload contains other octets, the behavior of clients is implementation-dependent. By the above specification, the client is disallowed to guess charset which is not ASCII upper-compatible (such as UTF-7). The true intention of this specification is to make detection of <meta> tags much reliable. If UTF-7 and future ASCII-incompatible charsets are excluded, Web authors can put <meta> declaration in the very top of HTML documents and expect that it will be respected by the browser (as required by W3C spec). We can further force such detection mandatory, but I feel it is overkill for HTTP. (I have dropped ISO-8859-1 backward compatibility to the implementation-defined level. I have once written a proposal including full ISO-8859-1 compatibility, but it had become much complicated and unrealistic. I hope this does not make any real problems.) -- Yutaka OIWA, Ph.D. Research Scientist Research Center for Information Security (RCIS) National Institute of Advanced Industrial Science and Technology (AIST) Mail addresses: <y.oiwa@aist.go.jp>, <yutaka@oiwa.jp> OpenPGP: id[995DD3E1] fp[3C21 17D0 D953 77D3 02D7 4FEC 4754 40C1 995D D3E1]
Received on Wednesday, 23 January 2008 02:26:30 UTC