Re: Unknown text/* subtypes [i20] from Robert Sayre on 2008-02-13 (ietf-http-wg@w3.org from January to March 2008)

From: Robert Sayre <rsayre@mozilla.com>
Date: Wed, 13 Feb 2008 17:01:04 -0500
To: Roy T. Fielding <fielding@gbiv.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>, Julian Reschke <julian.reschke@gmx.de>, Geoffrey Sneddon <foolistbar@googlemail.com>, Mark Nottingham <mnot@mnot.net>
Message-Id: <D7031C3C-A5AE-45CF-A34B-B88DB3F7963E@mozilla.com>

On Feb 12, 2008, at 4:12 PM, Roy T. Fielding wrote:
>
> the Web consists of dozens of different charsets,
> most of which are left unlabeled because there is no commonly accepted
> way of indicating charsets in filename metadata (and no real need to
> anyway, since user agents will either sniff the content anyway or just
> assume everything is in the fixed local charset known by the tool).
>

Fully agree.


> Servers, OTOH, send text/* content with the assumption that it will be
> treated as iso-8859-1 (or at least some safe superset of US-ASCII).

Somewhat disagree. I think many servers assume that UAs will sniff,  
and deal with the issue for them.

>
> Servers don't sniff content because they can't -- it is impossible to
> look at every byte of a page while handling 7,000 reqs/sec, let alone
> the 20,000 reqs/sec that a decently tuned server can handle.  In  
> addition,
> some servers (particularly when serving dynamic content) will add a
> charset parameter to unlabeled text/html content based upon how they  
> have
> been configured to scan for cross-site scripting.  They do so  
> specifically
> because of known bugs in browsers that sniff the content for bizarre
> charsets that bypass the resource's security assumptions and
> cause the browser's user to fall victim to stupid XSS attacks.

I know some cases of this attack, but I would appreciate more detailed  
references on these if you have them.

>
> That allows HTTP/1.1 compliant serving today to remain compliant
> after the change, and addresses all of the interoperability issues
> in regard to mislabeled content without ignoring the fact that the
> main reason they are mislabeled today is to work around existing
> bugs.  For all other cases, the charset can and should be labeled
> correctly.

I agree with your conclusion, but I'm fuzzy on the spec text it would  
lead to. Have specific wording in mind?

- Rob

Received on Wednesday, 13 February 2008 22:01:25 UTC