MIME and the Web

This is a draft  (for TAG ACTION-425) for discussion at an upcoming
W3C TAG meeting.  In its current form, I think it may be more suitable
as a 'note', but I'd like to get TAG (and community, including W3C and
IETF) agreement on its contents, before going on to update various
Findings, BCPs and specifications.



MIME and the Web


Origins of MIME


MIME was invented originally for email, based on general principles of
'messaging', foundational architecture.   The role of MIME was to
extend Internet messaging from ASCII-only plain text (other character
sets,  images, rich documents, etc.) The basic architecture of complex
content messaging is:


*	Message sent from A to B.
*	Message includes some data.   Sender A includes standard
'headers' telling recipient B enough information that recipient B
knows how sender  A intends the message to be interpreted. 
*	Recipient B gets the message, interprets the 'headers' for the
data and uses it as information on how to interpret the data.


MIME is a "tagging and bagging" specficiation:

*	 tagging: how to label content so the intent of how the
content should be interpreted is known
*	 bagging: how to wrap the content so the label is clear, or,
if there are multiple parts to a single message, how to combine them.


"MIME types" (renamed "Internet Media Types") were part of the
labeling, the name space of kinds of things.

The MIME type registry ("Internet Media Type registry") is where
someone can tell the world what a particular label means, as far as
the sender's intent.


Introducing MIME into the web


    The original "World Wide Web"  didn't have MIME tagging and

*	Everything was HTML (more or less)
*	HTTP 0.9 assumed that what was transferred could be


Around then, Gopher (hyperlink menu system) was quite popular , knew
about a couple of 'link types'. I'd been working at Xerox PARC on a
system for document storage and access that used file types and
allowed the client to ask for the types of storage it wanted.


Working on Gopher, and then on WWW, the proposal (around) 1991 was
that Gopher and WWW should use MIME types as the vocabulary for
talking about file types.


The result was that HTTP 1.0 included type label, "content-type",
following (kind of, with a couple of exceptions) MIME. Later, for
content negotiation, additional uses of this technology (in 'Accept'
headers) was also added.


The differences with MIME were minor (default charset, requirement for
CRLF in plain text). These minor differences have caused a lot of
trouble anyway, but that's another story.


Not quite a good match


Unfortunately, the use of MIME for the web was a good start, but  the
web isn't quite messaging:

(a)    messages are generally specifically in response to a request;
this means you know more about the data before you receive it. In
particular, the data really does have a 'name' (mainly, the URL used
to access the data), while in messaging, the messages were anonymous.

(b)   some content isn't really delivered over the net (files on local
file system), or there is no opportunity for tagging (data delivered
over FTP) and in those cases, the additional information is crucial.


At the same time, operating systems were using, and continued to
evolve to use, different systems to determine the 'type' of something,
different from the MIME tagging and bagging:


a)      using 'magic numbers': in many contexts, file types could be
guessed pretty reliably by looking for headers

b)      Originally MAC OS had a 4 character 'file type' and another 4
character 'creator code' for file types

c)       Windows evolved to use the "file extension" - 3 letters (and
then more) at the end of the file name


This wasn't entirely unanticipated in MIME, e.g., the MIME type
registry encouraged those registering MIME types to also describe
'magic numbers', Mac file type, common file extensions.


The Rules Weren't Quite Followed


a)      Lots of file types aren't registered (no entry in IANA for
file types)

b)      Those that are, the registration is incomplete or incorrect
(people doing registration didn't understand 'magic number')


Bad things happened:


a)      Browser implementors would be liberal in what they accepted,
and use file extension and/or magic number or other 'sniffing'
techniques to decide file type, without assuming content-label was
authoritative. This was necessary anyway for files that weren't
delivered by HTTP.

b)      HTTP server implementors and administrators didn't supply ways
of easily associating the 'intended' file type label with the file,
resulting in files frequently being delivered with a label other than
the one they would have chosen if they'd thought about it, and if
browsers *had* assumed content-type was authoritative.


Which of these happened first doesn't quite matter (most likely a,
then b), but it's a viscous cycle, anyway.


Result is not good:


Result, though, is that the web is unreliable, in that servers sending
responses to browsers don't have a good guarantee that the browser
won't "sniff" the content and decide to do something other than treat
it as it is labeled, and browsers receiving content don't have a good
guarantee that the content isn't mis-labeled, and intermediaries like
gateways, proxies, caches, and other pieces of the web infrastructure
don't have a good way of telling what the conversation means.  


This ambiguity and 'sniffing' also applies to packaged content in
webapps ('bagging' but using ZIP rather than MIME multipart).


Extensibility, content negotiation


Adding MIME to the web introduced an enormous path for extensibility
of the web. The fact that HTTP could reliably transport images allowed
NCSA to add img to HTML and reliably deliver multiple image types. The
addition of MIME allowed other document formats (Word, PDF,
Postscript) and other kinds of hypermedia, as well as applications.
MIME was an important engine for extensibility in messaging.  Of
course, extensibility has its own problems. When senders use
extensions recipients aren't aware of, implement incorrectly or
incompletely, then communication often fails.  With messaging, this is
a serious problem, although most 'rich text' documents are still
delivered in multiple forms (using multipart/alternative). With the
web, the idea was to provide 'content negotiation', but basing content
negotiation solely on Internet Media Types has some serious (fatal)


The MIME story covers charsets as well


While the above tale was written about Internet Media Types, the same
kind of vicious cycle also happened with character set labels:
mislabeled content happily processed correctly by liberal browsers
encouraged more and more sites to proliferate text with  mis-labeled
character sets, to the point where browsers feel they *have* to guess
the wrong label.


Some additional requirements


The specifications for MIME and Internet Media Types and, its design,
may have some additional requirements that haven't been explored well.
There are two particularly interesting use cases:


*	 "Polyglot" documents:  A 'polyglot' document is one which is
some data which can be treated as two different Internet Media Types,
in the case where the meaning of the data is the same. This is part of
a transition strategy to allow content providers (senders) to manage,
produce, store, deliver the same data, but with two different labels,
and have it work equivalently with two different kinds of receivers
(one of which knows one Internet Media Type, and another which knows a
second one.) This use case was part of the transition strategy from
HTML to an XML-based XHTML, and also as a way of a single service
offering both HTML-based and XML-based processing (e.g., same content
useful for news articles and web pages.


*	"Alternate views": This use case seems similar but it's quite
different. This is the use case where the same data, has very
different meaning when served as two different content-types, but that
difference is intentional; for example, the same data served as
text/html is a document, and served as an RDFa type is some specific
data. (not sure what to call these).


Some additional things people would like to do that are harder


(want to expand these later, park desirata here):


*	distinguish different versions with different headers or
*	content negotiation
*	knowing the type of something isn't something you can handle
before you ask for it


Relationship of Internet Media Type and internal version indicators


(need to expand this)


The notion of an "Internet Media Type" is very course-grained. In
general, for example, languages and formats evolve over time, and in
many cases, the evolution might involve having different kinds of
processors, or needing to know not only the general "Media Type" but
the specific version. The general approach to this has been that the
actual Media Type includes provisions for version indicator(s)
embedded in the content itself to determine more precisely the nature
of how the data is to be interpreted.  That is, the message itself
contains further information.  


Unfortunately, lots has gone wrong in this scenario as well -
processors ignoring version indicators encouraging content creators to
supply incorrect version indicators.



Fragment identifiers


The web added this notion of being able to address part of a content
and not the whole content by adding a 'fragment identifier' to the URL
that addressed the data. Of course, this originally made sense for the
original web with just HTML, but how would it apply to other content.
The URL spec glibly noted that "the definition of the fragment
identifier meaning depends on the MIME type", but unfortunately, few
of the MIME type definitions included this information, and practices
diverged greatly.


Where we need to go


In the above story, about MIME and the web, there is nothing about
"authoritative" and priorities. Stuff happens. There is no "license" -
a content-type header doesn't give "permission" for the recipient to
do anything.


We need a clear direction on how to make the web more reliable, not
less. We need a realistic transition plan from the unreliable web to
the more reliable one. Part of this is to encourage senders (web
servers) to mean what they say, and encourage recipients (browsers) to
give preference to what the senders are sending.


We should try to create specifications for protocols and best
practices that will lead the web to more reliable and secure
communication. To this end, we give an overall architectural approach
to use of MIME, and then specific specifications, for HTTP clients and
servers, Web Browsers in general, proxies and intermediaries, which
encourage behavior which, on the one hand, continues to work with the
already deployed infrastructure (of servers, browsers, and
intermediaries), but which advice, if followed, also improves the
operability, reliability and security of the web.


Specific recommendations


(I think I want to see if we can get agreement on the background,
problem statement and requirements, before sending out any more about
possible solutions, however the following is a partial list of
documents that should be reviewed & updated, or new documents written


update MIME / Internet Media Type registration process (IETF BCP)

possibly URI/IRI scheme registration process (?? fragment identifier

update Tag finding on authoritative metadata 

new:  MIME and Internet Media Type section to WebArch

New: Add a W3C web architecture material on MIME in HTML to W3C web

update HTML spec on sniffing, versioning, MIME types, charset sniffing

update WEBAPPS specs (which ones?)

update sniffing spec


Received on Sunday, 30 May 2010 03:57:28 UTC