Re: Content Sniffing impact on HTTPbis - #155

On Sat, Jun 13, 2009 at 11:15 AM, Jamie Lokier<jamie@shareable.org> wrote:
> Adam Barth wrote:
>> Do you have evidence for this belief?  It should be fairly easy to
>> determine by looking at the source code.
>
> It's easy to determine by simply trying it.
>
> I've just created a small file with this content (not indented):
>
>    <html><head></head><body>
>    Hello, I am <b>HTML</b>
>    </body></html>
>
> If it's called test.html, it will display as HTML.
> If it's called test.txt, it will display as plain text.
> ==> If it's called test.foo, it will display as HTML.
> ==> If it's called just test (no extension), it will display as HTML.
>
> But if we change the file slightly, putting a single character x in
> front like this:
>
>    x<html><head></head><body>
>    Hello, I am <b>HTML</b>
>    </body></html>
>
> If it's called test.html, it will display as HTML.
> If it's called test.txt, it will display as plain text.
> ==> If it's called test.foo, it will display as plain text.
> ==> If it's called just test (no extension), it will display as plain text.
>
> Therefore Firefox (3.0.10) does sniff a local file to determine how to
> display it, and the sniffing algorithm (or whether to apply it) _does_
> depend on the file extension.

Thanks for running these tests.  This is very helpful.  I don't agree
with you conclusion, however.  Here's an alternate explanation:

1) The "file" protocol handler uses OS-specific mechanisms to
determine the media type of the file.  (In this case, the OS uses the
extension).

2) After receiving the response from the file protocol handler, the
sniffing algorithm decides whether or not to sniff based on the media
type.  (In this case, the sniffing algorithm runs only if the media
type is unknown.)

3) Your first example triggers the HTML heuristic in the sniffing
algorithm.  Your second example does not and falls back to
text-or-binary.

I believe this theory accounts for all the behavior you report above.
Under this explanation, we see that the sniffing algorithm does not
use the file extension.

> It's clear from trying it that Firefox applies a sniffing algorithm to
> local files, and either it is influenced by the file's extension, or
> decided whether to apply the algorithm at all depending on the
> extension.

I believe the second account is clearer way to understand what's going on.

> I don't know what it does with FTP, but I wouldn't be surprised if
> it's the same as local files.

We should test to be sure.

> Now, let's get back to HTTP.  I've done the same test as above with
> HTTP in the same Firefox.
>
> If the Content-Type is text/plain or text/html, then Firefox honours
> the Content-Type, independent of whether the content has "x" at the
> start in these two test files.

This is consistent with draft-abarth-mime-sniff.

> If the Content-Type is application/octet-stream, then Firefox does
> different things depending on the URL's file extension.  If it ends
> with .html, Firefox shows an error dialog(!), otherwise it offers to open
> the file in an application of your choice.

I'm not able to reproduce this error dialog.  Can you test again?

> If the Content-Type is blank, because I couldn't persuade Apache to
> omit it completely, then Firefox behaviour depends on the URL's file
> extension.

I recommend against using Apache to conduct these tests.  You can
achieve much more accurate results using netcat because you can
controller precisely which bytes you send to the client.

>    <html><head></head><body>
>    Hello, I am <b>HTML</b>
>    </body></html>
>
> If it's called http://.../test.html, it will display as HTML.
> If it's called http://.../test.txt, it will display as plain text.
> If it's called http://.../test.foo, it will display as plain text.
> If it's called http://.../test, it will display as plain text.

I'm not able to replicate these results.  When I omit the Content-Type
header (using netcat), these bytes render as HTML regardless of file
extension.

>    x<html><head></head><body>
>    Hello, I am <b>HTML</b>
>    </body></html>
>
> If it's called http://.../test.html, it will display as plain text.
> If it's called http://.../test.txt, it will display as plain text.
> If it's called http://.../test.foo, it will display as plain text.
> If it's called http://.../test, it will display as plain text.

I can replicate these results.

> As you see, Firefox applies a similar sniffing test in these examples
> to decide whether to treat the resource as HTML or plain text, and it
> does use the URL's file extension in making it's decision.
>
> However, it doesn't use quite the same algorithm as for local files,
> as you can see from the .html and .foo extension differences.

I suspect there's something strange going on with your Apache testing
harness.  I'd recommend testing again with netcat.

> In the bigger picture, my point is that sniffing is used in practice,
> in a major browser, for local files as well as HTTP (and FTP but not
> shown here), and the decision about _whether_ to use it (at least)
> does depend on the file extension for HTTP as well as for local files.

I disagree.  The decision about whether to use it depends on the media
type reported by the protocol handler.  How the protocol handler
determines the media type is up to it.  For HTTP, it looks at the
Content-Type header.  For files, Firefox apparently consults the OS.

> It would be good to document and standardise when the sniffing
> algorithm is applied, dependent on file/URL extensions, for the same
> reason that it is good to document and standardise what the sniffing
> algorithm is.

If you'd like browsers to interoperate on local HTML files, then you
have much bigger problems then precisely when to apply the sniffing
algorithm.  That's a task for another day.  In any case, no changes to
the sniffing algorithm spec will be required.  We'll only have to spec
how the file protocol handler determines the media type.

> I don't know from these tests if the sniffing is simply switch on/off
> depending on file extensions or if it is influenced in a more
> fine-grained way.

We'd likely gain additional insight from reading the code.

Adam

Received on Saturday, 13 June 2009 18:37:10 UTC