Re: Content Sniffing impact on HTTPbis - #155 from Adam Barth on 2009-06-05 (ietf-http-wg@w3.org from April to June 2009)

From: Adam Barth <w3c@adambarth.com>
Date: Fri, 5 Jun 2009 12:55:56 -0700
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <7789133a0906051255l2257282eta9dffd930129a651@mail.gmail.com>

On Fri, Jun 5, 2009 at 9:14 AM, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
> I've already mentioned the encoding extraction algorithm, but to add
> some others: in draft-abarth-mime-sniff-01 section 3 step 3's special
> handling of very particular sequences,

For better or for worse, this is the way browsers work.  I believe
this is related to historical behaviors of Apache and other HTTP
servers.  Removing these rules cause binary spew to fill up the
content area.

> the handling of unregistered and malformed values in step 5,

HTTP responses commonly contain these values and depend on user agents
sniffing the actual media type.  Removing these values causes
compatibility problems.

> the special handling of XML types in step 6,

I believe this is to avoid step 7 applying to SVG images.

> the relevance of the implementation supporting particular types
> in step 7.

Would you prefer this step applied to all media types that begin with
"image/"?  That might let us remove step 6 as well.

> In section 4 why implementations may decide to pick any number of bytes
> between 0 and 512,

This is to avoid breaking sites that use comet
<http://en.wikipedia.org/wiki/Comet_(programming)>.

> why step 3 only applies when you have at least three
> bytes and then only compares two bytes,

We could change this to be slightly tighter, but it's a bit pedantic.

> why the UTF-32 BOM is not being detected,

We measured and determined that it was not needed for compatibility.
In general, we tried to minimize the amount of sniffing.

> why step four has those bytes and not others;

That's just how browsers work.  This is a point at which there is
broad interoperability already.  The costs of convincing
implemenations to change this table outweigh the benefits.

> in section 6 the special handling of image/svg+xml;

This is to above mistakenly changing the type of an image/svg+xml
resource that happens to begin with a magic number, e.g., BM.  Again,
this is a place where we are able to minimize the amount of sniffing.

> in section 7 why the UTF-16 BOM is ignored.

The algorithm doesn't work for UTF-16 anyway.  What would be the point
of skipping over the UTF-16 BOM?  Again, this is a place we've
minimized the amount of sniffing.

> I see no justification for having a special algorithm for the charset
> parameter;

I've added a TODO to investigate whether this algorithm is still needed.

Adam

Received on Friday, 5 June 2009 19:56:54 UTC