[contentTypeOverride-24] MSDN documentation on MIME sniffing in IE from Chris Lilley on 2003-04-06 (www-tag@w3.org from April 2003)

From: Chris Lilley <chris@w3.org>
Date: Sun, 6 Apr 2003 17:46:30 +0200
To: www-tag@w3.org
Message-ID: <78696978078.20030406174630@w3.org>
Hello,

  This MSDN documentation on the FindMimeFromData method (used by MS
  Internet Explorer for Windows, version 4.0 and later) seemed
  directly relevant to issue contentTypeOverride-24.

  http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp
  http://msdn.microsoft.com/library/default.asp?url=/workshop/networking/moniker/overview/appendix_a.asp

  Because the MSDN site seems to move content around frequently, I
  have quoted the text of that page in this message for archival
  purposes (for personal and non-commercial use, naturally).

  I note in passing that the MIME types for PNG (image/png) and
  Progressive JEG (image/jpeg) are incorrectly given as image/x-png
  and image/pjpeg respectively and that sniffing on the
  application/x-gzip-compressed type breaks HTTP Content-Encoding
  headers.

  I also note reports that for ASP and JSP pages that return non-HTML
  results, faking a 'filename extension' is often required, implying
  that the algorithm below is not complete (an 'unknown' mimetype
  being processed by stage 4 of the algorithm).

  http://www.svg.org/wiki/ow.asp?MimeType
  
  "The main reason this is propagated is that IE (prior to v6) does
  not behave properly in respect to SVG and MIME types. IE (prior to
  v6) treats SVG as SVG when, and only when, the url ends in .svg, it
  doesn't care about MIME.

  Netscape and (I assume) other browsers require that the server send
  the MIME type.

  So, in order to have all browsers accept your content properly: make
  sure that all URLS end in .svg, including generated content (for
  Internet Explorer -- for example, using a dummy parameter as in
  "http://mysvg.jsp?ielikes=.svg"); and set the MIME type on the
  server."

  
----8<--------------

MSDN Home >  MSDN Library >  Networking >  URL Monikers >  Overviews/Tutorials

Appendix A: MIME Type Detection in Internet Explorer
--------------------------------------------------------

The purpose of MIME type detection, or datasniffing, is to determine
the MIME type (also known as content type or media type) of downloaded
content using information from the following four sources:

- The server-supplied MIME type, if available

- An examination of the actual contents associated with a downloaded URL

- The file name associated with the downloaded content (assumed to be
derived from the associated URL)

- Registry settings (file extension/MIME type associations or
registered applications) in effect during the download

In Microsoft® Internet Explorer 4.0 and later, MIME type determination
occurs in URL monikers through the FindMimeFromData method.
Determining the MIME type allows URL monikers and other components to
find and launch the correct object server or application to handle the
associated content. This section provides a brief summary of the logic
used in determining the MIME type from these sources, and also
discusses some of the issues involved.

FindMimeFromData contains hard-coded tests for (currently 26) separate
MIME types (see Known MIME Types). This means that if a given buffer
contains data in the format of one of these MIME types, a test exists
in FindMimeFromData that is designed (by scanning through the buffer
contents) to recognize the corresponding MIME type. A MIME type is
known if it is one of these N MIME types. A MIME type is ambiguous if
it is 'text/plain', 'application/octet-stream', an empty string, or
null (that is, the server failed to provide it). A MIME type that is
neither known nor ambiguous is termed unknown. The MIME types
'text/plain' and 'application/octet-stream' are termed ambiguous
because they generally do not provide clear indications of which
application or CLSID should be associated as the content handler. A
MIME type inferred from any one of the four possible sources can be
categorized into one of these three classifications.

FindMimeFromData typically receives three parameters when invokedthe
cache file name (assumed to be derived from the associated URL), a
pointer to a buffer containing up to the first 256 bytes of the
content, and a "suggested" MIME type that typically corresponds to the
server-provided MIME type (through the Content-type header).
Determining the MIME type proceeds as follows:


If the "suggested" (server-provided) MIME type is unknown (not known
and not ambiguous), FindMimeFromData immediately returns this MIME
type as the final determination. The reason for this is that new MIME
types are continually emerging, and these MIME types might have
formats that are difficult to distinguish from the set of hard-coded
MIME types for which tests exist. A good example of this is SGML,
which can easily be classified incorrectly as HTML because it contains
many of the same tags. Rather than weakening the hard-coded tests or
risk incorrectly classifying new and as-yet-unknown MIME types for
hard-coded known ones, priority is given to the server-supplied MIME
type if it is unknown, since these MIME types are both specific and
likely uncommon, and there are no hard-coded tests that can positively
identify them.

If the server-provided MIME type is either known or ambiguous, the
buffer is scanned in an attempt to verify or obtain a MIME type from
the actual content. If a positive match is found (one of the
hard-coded tests succeeded), this MIME type is immediately returned as
the final determination, overriding the server-provided MIME type
(this type of behavior is necessary to identify a .gif file being sent
as text/html). During scanning, it is determined if the buffer is
predominantly text or binary.

If no positive match is obtained during the data scan, and if the
server-provided MIME type is known, an attempt is made to determine if
the format (text or binary) of the known MIME type conflicts with the
format (text or binary) that was determined from scanning the buffer.
If no conflict exists (the data scan indicates primarily text and the
server-provided MIME type has a text format, or the data scan
indicates binary and the server-provided MIME type is a binary
format), the server-provided MIME type is returned. The reasoning
behind this is that new formats of MIME types might be added over time
(image/tif is one example) and the hard-coded tests might not
recognize these new formats (a different pattern match might be
required). With the assumption that the basic format of MIME types
will not change over time from primarily text to binary or vice-versa,
it will suffice that the formats of the server-provided MIME type and
the format found from scanning the data do not disagree. If this is
the case, the server-provided MIME type is returned. The format types
for known MIME types are stored in a media information structure in
URL monikers.

If no positive match is obtained during the data scan, and the
server-provided MIME type is ambiguous or the server-provided MIME
type is known, and the data format agreement test in the previous step
failed, an attempt is made to parse a file extension from the file
name passed in. If this is successful, an attempt is made to find the
MIME type associated with the file extension in the registry. This
will be returned as the final determination if the MIME type
associated with the file extension is unknown. The reason for this
added requirement is as follows: If the file extension yields an
ambiguous MIME type, this adds no information to what was already
obtained through scanning the data. If the file extension yields a
known MIME type, this MIME type should have been found during
scanning. Since it was not, it is suspect, and is rejected. An example
of this is an arbitrary plain-text file being returned through an
ISAPI dynamic-link library (DLL), with the server returning
'text/plain' as the MIME type. Since the server-provided MIME type is
ambiguous, a scan of the data is conducted that only confirms that the
data is plain text. Subsequently, the file name is parsed for an
extension. In this case, because the contents were downloaded using an
ISAPI DLL, the URL and hence the cache file name will have a .dll file
extension that has the MIME type 'application/x-msdownload' associated
in the registry. This MIME type was already scanned for
(application/x-msdownload is a known MIME type), was not found, and is
therefore the wrong determination (this results in a file download as
opposed to the desired behavior, which is to display the text
in-pane).

If all of the preceding steps have failed to yield an unambiguous MIME
type, a last check is made to see if any application is associated in
the registry with the file extension parsed from the file name, if one
exists. If an associated application is found, the final determination
is automatically set to 'application/octet-stream'. This default value
ensures that the registered application will be launched by the shell
with the downloaded data, rather than displaying the data in-pane. As
an example, this is necessary when downloading, among others, .bat and
.cmd files, which are plain text files, are frequently identified by
the server as 'text/plain', and have no associated MIME type in the
registry. Without the final check for an associated application, these
would be displayed in-pane, whereas the desired behavior is to launch
the command interpreter. This is ensured by checking for an associated
application, and defaulting to the final determined MIME type of
'application/octet-stream'. Other types of files, such as .reg files,
behave similarly.

Finally, if no file extension is found, or one is found with no
associated MIME type or registered application, the MIME type
'text/plain' is returned if the data scan indicated predominantly
text, or 'application/octet-stream' if the data scan indicated binary,
since this is the furthest correct determination that could be made.

Known MIME Types

Hard-coded tests exist for the following MIME types that currently
exist in URL Moniker.

text/richtext 
text/html 
audio/x-aiff 
audio/basic 
audio/wav 
image/gif 
image/jpeg 
image/pjpeg 
image/tiff 
image/x-png 
image/x-xbitmap 
image/bmp 
image/x-jg 
image/x-emf 
image/x-wmf 
video/avi 
video/mpeg 
application/postscript 
application/base64 
application/macbinhex40 
application/pdf 
application/x-compressed 
application/x-zip-compressed 
application/x-gzip-compressed 
application/java 
application/x-msdownload 

Registry Locations
Location used by FindMimeFromData to find MIME type and progID from
file extension:

HKEY_CLASSES_ROOT\.***
Location used by FindMimeFromData to find application from progID:

HKEY_CLASSES_ROOT\<ProgId>\shell\open\command
Location used by URL monikers to find CLSIDs from MIME types:

HKEY_CLASSES_ROOT\MIME\Database\Content Type
 


-- 
 Chris                          mailto:chris@w3.org
Received on Sunday, 6 April 2003 11:46:34 UTC