Re: review of content type rules by IETF/HTTP community

Hi Boris,

Again, as a reminder I did create a wiki page on this issue[1].  
Responses below.

On Aug 24, 2007, at 3:50 PM, Boris Zbarsky wrote:

> I agree that investigating this might be worthwhile.  Some comments:
>
>> 1) what is the source of author error in setting file type  
>> information
>
> The main sources I've seen are ignorance and inability to affect  
> the server configuration, with the latter being very common.
>
> This combines with web servers that default unknown files to types  
> that majority UAs ignore and sniff (e.g. text/plain), to produce a  
> situation where there is little incentive for authors to either  
> learn or to seek out hosting providers that allow types to be set  
> on the server side.

One solution here might be to make use of WebDAV extended attributes  
and have server's treat those with higher precedence than filename  
extensions or MIMEMagic. This would be a much more sensible approach  
since authors would be setting the MIME type of files directly rather  
than relying on wiring mapping files (however the implementation of  
those WebDAV extended attributes are handled on the back-end.

> For example, nearly all OS X disk image files I've run into on the  
> web have been served as text/plain.  And it seems that the people  
> serving them are OK with this.
>
> The ignorance aspect shows up in two common ways: allowing the web  
> server to send its default type for the data, and using a server- 
> side solution that sends out a default MIME type unless overridden  
> (e.g. PHP often defaults to text/html) for generating all their  
> documents.  This way you get stylesheets, scripts, etc served as  
> text/html.

Yeah, that doesn't sound good. I guess IE-sniffing made this all  
become common practice in the early days of PHP server apps.

>>  • For local files, filename extensions have become the nearly  
>> universal practice for setting file type handling in local file  
>> systems.
>
> Which is unfortunate, since extensions do not uniquely determine  
> type (the "rpm" and "xml" extensions are good examples).

I assume by 'rpm' you're referring to it meaning RealPlayer media and  
Redhat Package Manager in different places. Part of the problem there  
is the obsession with thee-letter-extensions. Those could be replaced  
with .realplayer and .redhatpm and the problem would be fixed. I know  
that's not necessarily a good solution after the fact, but it would  
be a better practice for those inventing filename extensions in the  
first place.

For .xml, I'm not sure what you're referring to. Do you mean  
that .xml can map either to text/xml or application/xml? The  
distinction in the MIME types is itself ambiguous so I don't see the  
problem with not clearly mapping to one or the other. My  
understanding is that there's an RFC that deprecates 'text/xml'[2].

Or are you referring to the fact that application/xml defines many  
flavors of XML. This to me is not that much of a problem. As I've  
said before in this thread, I think UAs should be capable of handling  
XML sub-types as XML as well. For example receiving an atom feed, a  
browser should be able to switch between either the feed presentation  
of a raw XML  tree representation or even a raw source text  
representation (as many already do).

> In practice, various operating systems and applications use data  
> other than the extension to decide what to do with the data  
> (content sniffing, generall, but OS X 10.4 and later has a way of  
> tagging files with type information independent of the filename or  
> raw data bytes).

I assume here you're talking about Mac OS X's UTIs. However, UTIs do  
not tag files themselves. UTIs still rely on filename extensions (or  
the generally discouraged practice of using the type code and create  
code filesystem attributes). The system uses property list  
declarations within executable bundles to map the filename  
extensions, MIME types, creator codes, and type codes to the proper  
UTI. Tagging files with a UTI filesysten attribute might be the way  
to go in the future,, but I don't think a move like that would be  
good unless the whole industry wants to move away from filename  
extensions.

However, UTIs do raise another aspect of this similar to the XML  
flavors issue. That is that UTIs define an elaborate hierarchy of  
types where each type inherits aspects from different lines of  
inheritance[3]. So for example an rtfd (com.apple.rtfd) inherits from  
both com.apple.package and  public.composite- content. Those in turn  
inherit from other parent types on up the inheritance chain. So a  
document of type com.apple.rtrfd can be treated as any parent type by  
an application. This means applications that do not know how to treat  
an rtfd specifically can still be used to view or edit the rtfd file  
(in its own particular way). This is similar to the atom feed where  
the atom feed type would inherit from the public.xml UTI.

This suggests to me that these conformance hierarchies basically  
represent all of the ways an HTML author might want to treat a  
particular sub-resource. There may be some other exceptions where  
authors want to treat a file as something outside this conformance  
hierarchy, but I would think it's pretty rare.

A problem with using UTIs instead of MIME types is that MIME types  
reveal their hierarchies in their names. UTIs would require a method  
of dynamic discovery of UTI conformance hierarchies. So while UTIs  
provide an excellent mechanism for a decentralized UTI creation  
(using one's internet domain name in reverse order as a prefix for  
any UTI name), it still requires the environment to be configured and  
updated with proper UTI declarations.

>> It would be useful to determine if authors make any significant  
>> number of errors in setting filename extensions.
>
> Insofar as extensions are usable for type identification given the  
> issues above, generally no.  In my experience.  Of course that  
> doesn't help dynamically-generated content.

So are webapp authors not setting MIME types for their HTTP  
responses? I guess that sounds like an evangelism issue.

>> a missing filename extension (needed once the file goes tot he  
>> server or is accessed using HTTP).
>
> For what it's worth, Apache does have a mode where it will use  
> content-sniffing instead of filename extensions to determine the  
> types it will send.  It's not enabled by default (which is too  
> bad), and I don't know how commonly it's used.

It seems to me that by changing the default user configuration file  
for Apache could have a  more profound impact than creating a new RFC  
or W3C recommendation. Having it use the MIMEMagic would be an  
problem however since filename extensions would be the best mechanism  
for authors to deliberately change how a file was handled. So I think  
most administrators would still want the filename extension  
determination to have higher priority. But if the server doesn't  
recognize the filename extension, it may not recognize the file's  
content either. Someone would probably need to write a comprehensive  
MIMEMagic configuration file so that server admins could rely on it.  
For Apache to fix those bugs would be a better approach IMO.

>> I suspect the server mapping issue is a big source of the problem.
>
> Absolutely.  It's particularly a problem when a "new" extension  
> that the server is not aware of appears or becomes popular.
>
>> Typically the problem may occur when new file formats become  
>> common where the server has been installed and configured long  
>> before those formats (and their associated filename extensions)  
>> came onto the scene.
>
> As a UA developer, I can say that this is in fact the situation  
> that forced Gecko to add text/plain sniffing.
>
>> For example, if we explore these issues, and determine that  
>> filename extensions nearly universally reflect the authors  
>> intentions, then perhaps content sniffing is not the way to go
>
> Filename extensions don't cover dynamically generated content, for  
> which the relevant extensions are typically "php", "asp", "pl",  
> "exe", "cgi".

If the dynamically generated content is a direct in-memory response  
from the server without creating a separate file, then the filename  
extensions would not be relveant would they? However, if dynamically  
generated content is first saving files to static file before sending  
the response, then this again looks like the problem of too short of  
a filename extension. These could be .phphtml and .phpcss.; or  
siimply .html and .css.

Take care,
Rob


[1]: <http://esw.w3.org/topic/HTML/ContentTypeIssues>

2: <http://www.ietf.org/rfc/rfc3470.txt>

3 

Received on Monday, 27 August 2007 02:49:46 UTC