- From: Larry Masinter <LMM@acm.org>
- Date: Sat, 29 May 2010 20:56:48 -0700
- To: <www-tag@w3.org>
- Message-ID: <000e01caffac$2225f090$6671d1b0$@org>
This is a draft (for TAG ACTION-425) for discussion at an upcoming W3C TAG meeting. In its current form, I think it may be more suitable as a 'note', but I'd like to get TAG (and community, including W3C and IETF) agreement on its contents, before going on to update various Findings, BCPs and specifications. MIME and the Web Origins of MIME MIME was invented originally for email, based on general principles of 'messaging', foundational architecture. The role of MIME was to extend Internet messaging from ASCII-only plain text (other character sets, images, rich documents, etc.) The basic architecture of complex content messaging is: * Message sent from A to B. * Message includes some data. Sender A includes standard 'headers' telling recipient B enough information that recipient B knows how sender A intends the message to be interpreted. * Recipient B gets the message, interprets the 'headers' for the data and uses it as information on how to interpret the data. MIME is a "tagging and bagging" specficiation: * tagging: how to label content so the intent of how the content should be interpreted is known * bagging: how to wrap the content so the label is clear, or, if there are multiple parts to a single message, how to combine them. "MIME types" (renamed "Internet Media Types") were part of the labeling, the name space of kinds of things. The MIME type registry ("Internet Media Type registry") is where someone can tell the world what a particular label means, as far as the sender's intent. Introducing MIME into the web The original "World Wide Web" didn't have MIME tagging and bagging: * Everything was HTML (more or less) * HTTP 0.9 assumed that what was transferred could be Around then, Gopher (hyperlink menu system) was quite popular , knew about a couple of 'link types'. I'd been working at Xerox PARC on a system for document storage and access that used file types and allowed the client to ask for the types of storage it wanted. Working on Gopher, and then on WWW, the proposal (around) 1991 was that Gopher and WWW should use MIME types as the vocabulary for talking about file types. The result was that HTTP 1.0 included type label, "content-type", following (kind of, with a couple of exceptions) MIME. Later, for content negotiation, additional uses of this technology (in 'Accept' headers) was also added. The differences with MIME were minor (default charset, requirement for CRLF in plain text). These minor differences have caused a lot of trouble anyway, but that's another story. Not quite a good match Unfortunately, the use of MIME for the web was a good start, but the web isn't quite messaging: (a) messages are generally specifically in response to a request; this means you know more about the data before you receive it. In particular, the data really does have a 'name' (mainly, the URL used to access the data), while in messaging, the messages were anonymous. (b) some content isn't really delivered over the net (files on local file system), or there is no opportunity for tagging (data delivered over FTP) and in those cases, the additional information is crucial. At the same time, operating systems were using, and continued to evolve to use, different systems to determine the 'type' of something, different from the MIME tagging and bagging: a) using 'magic numbers': in many contexts, file types could be guessed pretty reliably by looking for headers b) Originally MAC OS had a 4 character 'file type' and another 4 character 'creator code' for file types c) Windows evolved to use the "file extension" - 3 letters (and then more) at the end of the file name This wasn't entirely unanticipated in MIME, e.g., the MIME type registry encouraged those registering MIME types to also describe 'magic numbers', Mac file type, common file extensions. The Rules Weren't Quite Followed a) Lots of file types aren't registered (no entry in IANA for file types) b) Those that are, the registration is incomplete or incorrect (people doing registration didn't understand 'magic number') Bad things happened: a) Browser implementors would be liberal in what they accepted, and use file extension and/or magic number or other 'sniffing' techniques to decide file type, without assuming content-label was authoritative. This was necessary anyway for files that weren't delivered by HTTP. b) HTTP server implementors and administrators didn't supply ways of easily associating the 'intended' file type label with the file, resulting in files frequently being delivered with a label other than the one they would have chosen if they'd thought about it, and if browsers *had* assumed content-type was authoritative. Which of these happened first doesn't quite matter (most likely a, then b), but it's a viscous cycle, anyway. Result is not good: Result, though, is that the web is unreliable, in that servers sending responses to browsers don't have a good guarantee that the browser won't "sniff" the content and decide to do something other than treat it as it is labeled, and browsers receiving content don't have a good guarantee that the content isn't mis-labeled, and intermediaries like gateways, proxies, caches, and other pieces of the web infrastructure don't have a good way of telling what the conversation means. This ambiguity and 'sniffing' also applies to packaged content in webapps ('bagging' but using ZIP rather than MIME multipart). Extensibility, content negotiation Adding MIME to the web introduced an enormous path for extensibility of the web. The fact that HTTP could reliably transport images allowed NCSA to add img to HTML and reliably deliver multiple image types. The addition of MIME allowed other document formats (Word, PDF, Postscript) and other kinds of hypermedia, as well as applications. MIME was an important engine for extensibility in messaging. Of course, extensibility has its own problems. When senders use extensions recipients aren't aware of, implement incorrectly or incompletely, then communication often fails. With messaging, this is a serious problem, although most 'rich text' documents are still delivered in multiple forms (using multipart/alternative). With the web, the idea was to provide 'content negotiation', but basing content negotiation solely on Internet Media Types has some serious (fatal) drawbacks. The MIME story covers charsets as well While the above tale was written about Internet Media Types, the same kind of vicious cycle also happened with character set labels: mislabeled content happily processed correctly by liberal browsers encouraged more and more sites to proliferate text with mis-labeled character sets, to the point where browsers feel they *have* to guess the wrong label. Some additional requirements The specifications for MIME and Internet Media Types and, its design, may have some additional requirements that haven't been explored well. There are two particularly interesting use cases: * "Polyglot" documents: A 'polyglot' document is one which is some data which can be treated as two different Internet Media Types, in the case where the meaning of the data is the same. This is part of a transition strategy to allow content providers (senders) to manage, produce, store, deliver the same data, but with two different labels, and have it work equivalently with two different kinds of receivers (one of which knows one Internet Media Type, and another which knows a second one.) This use case was part of the transition strategy from HTML to an XML-based XHTML, and also as a way of a single service offering both HTML-based and XML-based processing (e.g., same content useful for news articles and web pages. * "Alternate views": This use case seems similar but it's quite different. This is the use case where the same data, has very different meaning when served as two different content-types, but that difference is intentional; for example, the same data served as text/html is a document, and served as an RDFa type is some specific data. (not sure what to call these). Some additional things people would like to do that are harder (want to expand these later, park desirata here): * distinguish different versions with different headers or parameters * content negotiation * knowing the type of something isn't something you can handle before you ask for it Relationship of Internet Media Type and internal version indicators (need to expand this) The notion of an "Internet Media Type" is very course-grained. In general, for example, languages and formats evolve over time, and in many cases, the evolution might involve having different kinds of processors, or needing to know not only the general "Media Type" but the specific version. The general approach to this has been that the actual Media Type includes provisions for version indicator(s) embedded in the content itself to determine more precisely the nature of how the data is to be interpreted. That is, the message itself contains further information. Unfortunately, lots has gone wrong in this scenario as well - processors ignoring version indicators encouraging content creators to supply incorrect version indicators. Fragment identifiers The web added this notion of being able to address part of a content and not the whole content by adding a 'fragment identifier' to the URL that addressed the data. Of course, this originally made sense for the original web with just HTML, but how would it apply to other content. The URL spec glibly noted that "the definition of the fragment identifier meaning depends on the MIME type", but unfortunately, few of the MIME type definitions included this information, and practices diverged greatly. Where we need to go In the above story, about MIME and the web, there is nothing about "authoritative" and priorities. Stuff happens. There is no "license" - a content-type header doesn't give "permission" for the recipient to do anything. We need a clear direction on how to make the web more reliable, not less. We need a realistic transition plan from the unreliable web to the more reliable one. Part of this is to encourage senders (web servers) to mean what they say, and encourage recipients (browsers) to give preference to what the senders are sending. We should try to create specifications for protocols and best practices that will lead the web to more reliable and secure communication. To this end, we give an overall architectural approach to use of MIME, and then specific specifications, for HTTP clients and servers, Web Browsers in general, proxies and intermediaries, which encourage behavior which, on the one hand, continues to work with the already deployed infrastructure (of servers, browsers, and intermediaries), but which advice, if followed, also improves the operability, reliability and security of the web. Specific recommendations (I think I want to see if we can get agreement on the background, problem statement and requirements, before sending out any more about possible solutions, however the following is a partial list of documents that should be reviewed & updated, or new documents written update MIME / Internet Media Type registration process (IETF BCP) possibly URI/IRI scheme registration process (?? fragment identifier use??) update Tag finding on authoritative metadata new: MIME and Internet Media Type section to WebArch New: Add a W3C web architecture material on MIME in HTML to W3C web site update HTML spec on sniffing, versioning, MIME types, charset sniffing update WEBAPPS specs (which ones?) update sniffing spec
Received on Sunday, 30 May 2010 03:57:28 UTC