security impact of dropping charset default [Re: text/* types and charset defaults [i20]]

> To be clear, we're talking about removing 
> <http://tools.ietf.org/id/draft-ietf-httpbis-p3-payload-01.txt>, section 
> 2.3.1, the entire forth paragraph (i.e., the last one in that section). 
> This includes removing both the defaulting and the MUST-level 
> requirement for labeling text/* in a charset other than ISO-8859-1.

In general, I agree for dropping "ISO-8859-1" default for text/* content types,
however, for "text/html" I have a specific concern with that.

As mentioned by many other people, many current browsers ignore HTTP/1.1
specification and implement charset auto-detection and <meta http-equiv>
tag detection.  This has caused several cross-site scripting
vulnerabilities.

The way of the attack is to insert an ASCII byte sequence which looks
like UTF-7 escaped string at some earlier point of the documents (where
browsers uses for character set detection), and insert a UTF-7-encoded
<script> tag in the documents.

The most effective countermeasure to this attack is declaring charset
in the HTTP header.  However, there are some issues about that:

  * It is not always possible to declare charsets in HTTP headers,
    especially for static contents.

  * Charsets are somewhat "open" standard, at least from the viewpoint of
    HTTP WG and W3C. It is not possible to ban future problematic
    charsets (e.g. UTF-7) from being defined.

  * Charset auto-detection and <meta http-equiv="content-type"> charset
    detection interfere each other. However, it is almost impossible to
    specify the detailed behavior of charset detection algorithms.

  * Existing ASCII-based applications should be kept safe for backward
    compatibility, at least in the specification level.

There are number of ways to solve this, and my current preference is
to add the following restrictions regarding charset auto-detection:

  * If charset is declared in the header, it MUST be honored. (current
    requirement in 2.1.1 may be copied).

  * If charset is not declared in the header, clients MAY guess the
    charset of the payload by any means (e.g. by examining the payload
    octets, using special attributions defined for content-types, or
    using the client-defined defaults).  However, if the payload is
    composed solely by octets representing ASCII printable characters and
    HTML-defined control characters (CR, LF, HT, VT and SP), it MUST be
    treated as if it is in ASCII or equivalent character sets. If the
    payload contains other octets, the behavior of clients is
    implementation-dependent.

By the above specification, the client is disallowed to guess charset
which is not ASCII upper-compatible (such as UTF-7).  The true intention
of this specification is to make detection of <meta> tags much reliable.
If UTF-7 and future ASCII-incompatible charsets are excluded, Web
authors can put <meta> declaration in the very top of HTML documents and
expect that it will be respected by the browser (as required by W3C spec).
We can further force such detection mandatory, but I feel it is overkill
for HTTP.

(I have dropped ISO-8859-1 backward compatibility to the
  implementation-defined level.  I have once written a proposal
  including full ISO-8859-1 compatibility, but it had become much complicated
  and unrealistic.  I hope this does not make any real problems.)

-- 
Yutaka OIWA, Ph.D.                                       Research Scientist
                             Research Center for Information Security (RCIS)
     National Institute of Advanced Industrial Science and Technology (AIST)
                       Mail addresses: <y.oiwa@aist.go.jp>, <yutaka@oiwa.jp>
OpenPGP: id[995DD3E1] fp[3C21 17D0 D953 77D3 02D7 4FEC 4754 40C1 995D D3E1]

Received on Wednesday, 23 January 2008 02:26:30 UTC