Re: review of content type rules by IETF/HTTP community

I think the discussion over this has been interesting and fruitful. I  
wonder if we might draw some productive issues from it.

We have a problem that file types are not labeled properly. Karl and  
Leif identified one part of this issue as a disjoint between the  
local practice of filename and the server practice of content  
headers. Sam suggested we might change those headers by indicating ! 
important or something like that. Browsers have tried to solve this  
problem by sniffing content (which actually contributes to the  
problem since authors are unaware of their errors in setting metadata  
because of the content sniffing). In addition to other approaches, I  
like Roy's suggestion of treating the mismatches as an error (even if  
we require that error be handled gracefully).

So I think we all have a good understanding of the general problem.  
However, the specifics are the essential missing pieces from much of  
the discussion. I propose we try to focus more on those. In  
particular, 1) what is the source of author error in setting file  
type information; and 2) how do browsers currently handle type  
determination?

So here's some avenues of investigation:

Building on Kar's point about the mismatch between local and server  
approaches:

  • For local files, filename extensions have become the nearly  
universal practice for setting file type handling in local file  
systems. It would be useful to determine if authors make any  
significant number of errors in setting filename extensions. If they  
do, where does this happen and can we learn anything about why it  
happens. Is there anything we as the HTML WG (or in cooperation with  
other groups) can do to address this problem. Mac OS certainly  
creates opportunities for author error in this respect in that other  
type setting mechanisms can prevent authors from realizing a missing  
filename extension (needed once the file goes tot he server or is  
accessed using HTTP). However, modern Mac OS applications make if  
difficult to create new content that doesn't have a proper filename  
extension

  • Servers typically try to map filename extensions to content type  
headers. Servers may also be configured to provide content headers  
that are not based solely on the filename extension (or not at all).  
Is this a large source for the problem

iI suspect the server mapping issue is a big source of the problem.  
That is authors set their filename extensions correctly because they  
receive immediate local OS feedback when the filename extension is  
wrong (except Mac OS provides other file type mapping techniques but  
they are seldom used for web-family resource files).

It would be useful to know if this is a big source of the problem.  
For example, it will do no good to specify a new 'important' content  
header syntax if that too will be mis-configured. Typically the  
problem may occur when new file formats become common where the  
server has been installed and configured long before those formats  
(and their associated filename extensions) came onto the scene.

Investigating these two issues (filename extensions and extension to  
header mapping) might require mere discussion among the WG members.  
What do we think about mistaken filename extensions? What do we think  
about mistaken filename extension to content header mappings? Is  
there any library research or research from W3C members that might  
shed some light on the issue?

For example, if we explore these issues, and determine that filename  
extensions nearly universally reflect the authors intentions, then  
perhaps content sniffing is not the way to go (this is just a  
hypothetical, it may not be the case). In that case browsers that  
think they're providing greater value to their users by sniffing  
content, are not doing that. It is the browser that treats filename  
extensions or filename extensions in combination with content headers  
as authoritative that will provide a better experience than content  
sniffing. Sure content sniffing an image may be easy to do, but if  
the only time an image has a different filename extension is the  
times when an author wants it treated as a download (just as a semi- 
flawed example) then the browser that doesn't sniff provides a better  
user experience. At other times, a filename extension may be missing  
or unknown. There it might make sense to turn to sniffing as another  
(probably rare) fallback mechanism.

Determining the source of author error is one avenue of exploration.  
The other specific that would help the conversation would be to  
understand what browsers are doing now. It would be helpful for the  
WG to conduct some research to find out how the latest browsers treat  
content whose labeling differs for all sorts of resources and sub- 
resources. So I suggest we might investigate

  A. identifying how current browsers handle content based on
      -  content-type header
      -  filename extension
      -  @type attribute
      -  sniffed content
  B. determine the browser priorities for the content indicators  
listed in 'A' for:
      -  the main resource (including HTML, XML, RSS, ATOM, MPEG,  
JPEG, PNG, etc.)
      -  LINK@href  for style sheet data
      -  LINK@href  for other data
      -  SCRIPT@src
      -  OBJECT@data
      -  IMG@src
      -  IMG@longdesc
      -  @cite
      -  A@href
      -  AREA@href
      -  etc.

I think having this information might help focus the conversation  
better.

Take care,
Rob

Received on Friday, 24 August 2007 12:40:39 UTC