Re: NEW ISSUE: content sniffing from Adam Barth on 2009-04-02 (ietf-http-wg@w3.org from April to June 2009)

From: Adam Barth <w3c@adambarth.com>
Date: Thu, 2 Apr 2009 16:22:14 -0700
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <7789133a0904021622m76d37882o6c4c39d25590b1a5@mail.gmail.com>
On Thu, Apr 2, 2009 at 2:32 PM, Roy T. Fielding <fielding@gbiv.com> wrote:
> Maybe the implementors of Imageshop will read the thread and
> understand that the media type of a message is not the same
> thing as the data format of a message.  The media type (what was
> called MIME type ages ago) is a processing instruction supplied
> by the sender.  The media type cannot be discerned by looking at
> the bits.  The data format can sometimes be discerned by looking
> at the bits, which is a reasonable fallback behavior depending on
> the context in which the request was made.

You're ignoring the reality of existing Web content.  To interoperate
with existing Web content, a user agent must consider both the
Content-Type headers and the content when determining the media type
contained in a response.  To claim otherwise is fantasy.

> It is impossible to sniff for a media type because any given
> data format matches at least two or more media types.

This is true in general.  However, existing Web content assumes that
user agents will override their specified media types in certain
cases.  For example, suppose a user agent receives the following HTTP
response from the Web.

Content-Type: */*

GIF89a....

If this user agent wishes to interoperate with this server, the user
agent should use the media type image/gif when processing this
request.

> None of those
> variables have anything to do with HTTP.  HTTP is responsible
> for communicating the sender's intentions.

Currently, the HTTP spec ignores reality and forbids user agents from
interoperating with existing Web content.

> Then fix the content metadata.  No other solution will work, period.

I agree that servers should fix their metadata.  However, not all
servers will.  If a user agent wishes to interoperate with these
servers (as many do), then we should be helpful and explain how to do
so in a reliable way.

> We would be better off if none of them sniffed.  That is the most
> interoperable solution.

Forbidding sniffing prevents user agents from interoperabling with
existing Web content.

>> I'm not proposing the spec describe the error handling quirks of
>> browsers.  I'm propose that the spec contain enough detail that
>> implementors of future user agents (if they are so inclined) can
>> determine the MIME type of HTTP responses from the Web.
>
> It already does contain everything that can be truly said about
> determining the media type.

The spec could provide a sniffing algorithm that allows user agents to
interoperate with existing Web content.

> The only thing it doesn't define is
> what the recipient should do when it detects an error or when
> no type is supplied, and the reason for that is because the behavior
> is different for every single type of recipient.

The type of recipient isn't really relevant.  What's relevant is how
existing servers expect their existing content to be interpreted.  A
server on the Web that specifies a Content-Type of "*/*" and a payload
that begins "GIF89a" expects this response to be treated as image/gif.

> The only reason
> that the HTML5 folks can pretend to answer that question is because
> they currently ignore the needs of all recipients other than the
> big general-purpose browsers.

Imageshop is not a "big general-purpose browser."

> IETF concerns >> WHATWG concerns.

I don't think it's helpful to frame this discussion in terms of
identity politics.

>> Sadly, such user agents will not be as popular as those that just work.
>
> That is a matter of opinion.  I have seen no evidence to suggest
> that MSIE bugs actually helped it in competing with other browsers.

If you asked them, I'm sure implementors of non-MSIE user agents would
tell you that sniffing is required to avoid losing user.  For example,
I recently received a bug report from a user who was unable to buy
something at BestBuy.com because Chrome did not sniff aggressively
enough in some corner case.  Had we not fixed this issue, that user
would have simply switched to another browser that worked.

> I have seen plenty of evidence that MSIE is upgraded or uninstalled
> on an institutional basis when its bugs create a liability.  The
> same will hold true for other browsers.

Compatibility is widely documented to be one of the most important
factor (if not THE most important factor) in determining whether a
user will adopt a new browser.

>> Be that as it may, as an implementor of a new user agent that would
>> like to interoperate with the Web, I would like to know how to
>> determine the MIME type of existing Web content.
>
> Read the Content-Type header field and behave accordingly. If it is
> obviously in error, then work around that error while informing
> the user.

In order to interoperate correctly, I need to know HOW to work around
the error.  If we don't specify an algorithm that works, I'll have to
reverse engineer other implementations.  This will lead to further
compatibility and security problems.

>> You're entitled to that opinion, but I don't see content sniffing
>> going away anytime soon.
>
> Time will tell.  I only document the technical solutions that
> actually work.

The algorithm described in draft-abarth-mime-sniff is a technical
solution that actually works.  If a user agent implements that
algorithm, the user agent can determine the media type of HTTP
responses and interoperate with existing Web content.

Adam
Received on Thursday, 2 April 2009 23:23:06 UTC