Re: review of content type rules by IETF/HTTP community from Boris Zbarsky on 2007-08-24 (public-html@w3.org from August 2007)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Fri, 24 Aug 2007 15:50:26 -0500
To: Robert Burns <rob@robburns.com>
CC: public-html@w3.org
Message-ID: <46CF4492.6000303@mit.edu>
Robert Burns wrote:

I agree that investigating this might be worthwhile.  Some comments:

> 1) what is the source of author error in setting file type information

The main sources I've seen are ignorance and inability to affect the 
server configuration, with the latter being very common.

This combines with web servers that default unknown files to types that 
majority UAs ignore and sniff (e.g. text/plain), to produce a situation 
where there is little incentive for authors to either learn or to seek 
out hosting providers that allow types to be set on the server side.

For example, nearly all OS X disk image files I've run into on the web 
have been served as text/plain.  And it seems that the people serving 
them are OK with this.

The ignorance aspect shows up in two common ways: allowing the web 
server to send its default type for the data, and using a server-side 
solution that sends out a default MIME type unless overridden (e.g. PHP 
often defaults to text/html) for generating all their documents.  This 
way you get stylesheets, scripts, etc served as text/html.

>  • For local files, filename extensions have become the nearly universal 
> practice for setting file type handling in local file systems.

Which is unfortunate, since extensions do not uniquely determine type 
(the "rpm" and "xml" extensions are good examples).  In practice, 
various operating systems and applications use data other than the 
extension to decide what to do with the data (content sniffing, 
generall, but OS X 10.4 and later has a way of tagging files with type 
information independent of the filename or raw data bytes).

> It would be useful to determine if authors make any significant number of errors 
> in setting filename extensions.

Insofar as extensions are usable for type identification given the 
issues above, generally no.  In my experience.  Of course that doesn't 
help dynamically-generated content.

> a missing filename extension (needed once the file goes tot he 
> server or is accessed using HTTP).

For what it's worth, Apache does have a mode where it will use 
content-sniffing instead of filename extensions to determine the types 
it will send.  It's not enabled by default (which is too bad), and I 
don't know how commonly it's used.

> I suspect the server mapping issue is a big source of the problem.

Absolutely.  It's particularly a problem when a "new" extension that the 
server is not aware of appears or becomes popular.

> Typically the problem may 
> occur when new file formats become common where the server has been 
> installed and configured long before those formats (and their associated 
> filename extensions) came onto the scene.

As a UA developer, I can say that this is in fact the situation that 
forced Gecko to add text/plain sniffing.

> For example, if we explore these issues, and determine that filename 
> extensions nearly universally reflect the authors intentions, then 
> perhaps content sniffing is not the way to go

Filename extensions don't cover dynamically generated content, for which 
the relevant extensions are typically "php", "asp", "pl", "exe", "cgi".

>  A. identifying how current browsers handle content based on
...
> I think having this information might help focus the conversation better.

Absolutely.

-Boris
Received on Friday, 24 August 2007 21:04:14 UTC