Re: Unknown text/* subtypes [i20]

On Feb 12, 2008, at 4:12 PM, Roy T. Fielding wrote:
>
> the Web consists of dozens of different charsets,
> most of which are left unlabeled because there is no commonly accepted
> way of indicating charsets in filename metadata (and no real need to
> anyway, since user agents will either sniff the content anyway or just
> assume everything is in the fixed local charset known by the tool).
>

Fully agree.


> Servers, OTOH, send text/* content with the assumption that it will be
> treated as iso-8859-1 (or at least some safe superset of US-ASCII).

Somewhat disagree. I think many servers assume that UAs will sniff,  
and deal with the issue for them.

>
> Servers don't sniff content because they can't -- it is impossible to
> look at every byte of a page while handling 7,000 reqs/sec, let alone
> the 20,000 reqs/sec that a decently tuned server can handle.  In  
> addition,
> some servers (particularly when serving dynamic content) will add a
> charset parameter to unlabeled text/html content based upon how they  
> have
> been configured to scan for cross-site scripting.  They do so  
> specifically
> because of known bugs in browsers that sniff the content for bizarre
> charsets that bypass the resource's security assumptions and
> cause the browser's user to fall victim to stupid XSS attacks.

I know some cases of this attack, but I would appreciate more detailed  
references on these if you have them.

>
> That allows HTTP/1.1 compliant serving today to remain compliant
> after the change, and addresses all of the interoperability issues
> in regard to mislabeled content without ignoring the fact that the
> main reason they are mislabeled today is to work around existing
> bugs.  For all other cases, the charset can and should be labeled
> correctly.

I agree with your conclusion, but I'm fuzzy on the spec text it would  
lead to. Have specific wording in mind?

- Rob

Received on Wednesday, 13 February 2008 22:01:25 UTC