- From: Noah Mendelsohn <nrm@arcanedomain.com>
- Date: Sun, 30 May 2010 22:24:24 -0400
- To: Larry Masinter <LMM@acm.org>
- Cc: "www-tag@w3.org" <www-tag@w3.org>
You wrote: > HTTP 0.9 assumed that what was transferred could be Is there some text missing at the end of that line? Thanks. Noah Larry Masinter wrote: > This is a draft (for TAG ACTION-425) for discussion at an upcoming W3C > TAG meeting. In its current form, I think it may be more suitable as a > ‘note’, but I’d like to get TAG (and community, including W3C and IETF) > agreement on its contents, before going on to update various Findings, > BCPs and specifications. > > > > * * > > *MIME and the Web* > > > > *Origins of MIME* > > > > MIME was invented originally for email, based on general principles of > ‘messaging’, foundational architecture. The role of MIME was to extend > Internet messaging from ASCII-only plain text (other character sets, > images, rich documents, etc.) The basic architecture of complex content > messaging is: > > > > * Message sent from A to B. > * Message includes some data. Sender A includes standard ‘headers’ > telling recipient B enough information that recipient B knows how > sender A intends the message to be interpreted. > * Recipient B gets the message, interprets the ‘headers’ for the > data and uses it as information on how to interpret the data. > > > > MIME is a “tagging and bagging” specficiation: > > * tagging: how to label content so the intent of how the content > should be interpreted is known > * bagging: how to wrap the content so the label is clear, or, if > there are multiple parts to a single message, how to combine them. > > > > “MIME types” (renamed “Internet Media Types”) were part of the labeling, > the name space of kinds of things. > > The MIME type registry (“Internet Media Type registry”) is where someone > can tell the world what a particular label means, as far as the sender’s > intent. > > > > *Introducing MIME into the web* > > > > The original “World Wide Web” didn’t have MIME tagging and bagging: > > * Everything was HTML (more or less) > * HTTP 0.9 assumed that what was transferred could be > > > > Around then, Gopher (hyperlink menu system) was quite popular , knew > about a couple of ‘link types’. I’d been working at Xerox PARC on a > system for document storage and access that used file types and allowed > the client to ask for the types of storage it wanted. > > > > Working on Gopher, and then on WWW, the proposal (around) 1991 was that > Gopher and WWW should use MIME types as the vocabulary for talking about > file types. > > > > The result was that HTTP 1.0 included type label, “content-type”, > following (kind of, with a couple of exceptions) MIME. Later, for > content negotiation, additional uses of this technology (in ‘Accept’ > headers) was also added. > > > > The differences with MIME were minor (default charset, requirement for > CRLF in plain text). These minor differences have caused a lot of > trouble anyway, but that’s another story. > > > > *Not quite a good match* > > > > Unfortunately, the use of MIME for the web was a good start, but the > web isn’t quite messaging: > > (a) messages are generally specifically in response to a request; > this means you know more about the data before you receive it. In > particular, the data really does have a ‘name’ (mainly, the URL used to > access the data), while in messaging, the messages were anonymous. > > (b) some content isn’t really delivered over the net (files on local > file system), or there is no opportunity for tagging (data delivered > over FTP) and in those cases, the additional information is crucial. > > > > At the same time, operating systems were using, and continued to evolve > to use, different systems to determine the ‘type’ of something, > different from the MIME tagging and bagging: > > > > a) using ‘magic numbers’: in many contexts, file types could be > guessed pretty reliably by looking for headers > > b) Originally MAC OS had a 4 character ‘file type’ and another 4 > character ‘creator code’ for file types > > c) Windows evolved to use the “file extension” – 3 letters (and > then more) at the end of the file name > > > > This wasn’t entirely unanticipated in MIME, e.g., the MIME type registry > encouraged those registering MIME types to also describe ‘magic > numbers’, Mac file type, common file extensions. > > > > *The Rules Weren’t Quite Followed* > > > > a) Lots of file types aren’t registered (no entry in IANA for file > types) > > b) Those that are, the registration is incomplete or incorrect > (people doing registration didn’t understand ‘magic number’) > > > > *Bad things happened:* > > > > a) Browser implementors would be liberal in what they accepted, and > use file extension and/or magic number or other ‘sniffing’ techniques to > decide file type, without assuming content-label was authoritative. This > was necessary anyway for files that weren’t delivered by HTTP. > > b) HTTP server implementors and administrators didn’t supply ways > of easily associating the ‘intended’ file type label with the file, > resulting in files frequently being delivered with a label other than > the one they would have chosen if they’d thought about it, and if > browsers **had** assumed content-type was authoritative. > > > > Which of these happened first doesn’t quite matter (most likely a, then > b), but it’s a viscous cycle, anyway. > > > > *Result is not good:* > > > > Result, though, is that the web is unreliable, in that servers sending > responses to browsers don’t have a good guarantee that the browser won’t > “sniff” the content and decide to do something other than treat it as it > is labeled, and browsers receiving content don’t have a good guarantee > that the content isn’t mis-labeled, and intermediaries like gateways, > proxies, caches, and other pieces of the web infrastructure don’t have a > good way of telling what the conversation means. > > > > This ambiguity and ‘sniffing’ also applies to packaged content in > webapps (‘bagging’ but using ZIP rather than MIME multipart). > > > > *Extensibility, content negotiation* > > > > Adding MIME to the web introduced an enormous path for extensibility of > the web. The fact that HTTP could reliably transport images allowed NCSA > to add img to HTML and reliably deliver multiple image types. The > addition of MIME allowed other document formats (Word, PDF, Postscript) > and other kinds of hypermedia, as well as applications. MIME was an > important engine for extensibility in messaging. Of course, > extensibility has its own problems. When senders use extensions > recipients aren’t aware of, implement incorrectly or incompletely, then > communication often fails. With messaging, this is a serious problem, > although most ‘rich text’ documents are still delivered in multiple > forms (using multipart/alternative). With the web, the idea was to > provide ‘content negotiation’, but basing content negotiation solely on > Internet Media Types has some serious (fatal) drawbacks. > > > > *The MIME story covers charsets as well* > > > > While the above tale was written about Internet Media Types, the same > kind of vicious cycle also happened with character set labels: > mislabeled content happily processed correctly by liberal browsers > encouraged more and more sites to proliferate text with mis-labeled > character sets, to the point where browsers feel they **have** to guess > the wrong label. > > > > *Some additional requirements* > > * * > > The specifications for MIME and Internet Media Types and, its design, > may have some additional requirements that haven’t been explored well. > There are two particularly interesting use cases: > > > > * * "Polyglot" documents:* A ‘polyglot’ document is one which is > some data which can be treated as two different Internet Media > Types, in the case where the meaning of the data is the same. This > is part of a transition strategy to allow content providers > (senders) to manage, produce, store, deliver the same data, but > with two different labels, and have it work equivalently with two > different kinds of receivers (one of which knows one Internet > Media Type, and another which knows a second one.) This use case > was part of the transition strategy from HTML to an XML-based > XHTML, and also as a way of a single service offering both > HTML-based and XML-based processing (e.g., same content useful for > news articles and web pages. > > > > * "*Alternate views”:* This use case seems similar but it’s quite > different. This is the use case where the same data, has very > different meaning when served as two different content-types, but > that difference is intentional; for example, the same data served > as text/html is a document, and served as an RDFa type is some > specific data. (not sure what to call these). > > > > *Some additional things people would like to do that are harder* > > * * > > /(want to expand these later, park desirata here):/ > > / / > > * distinguish different versions with different headers or parameters** > * content negotiation** > * knowing the type of something isn’t something you can handle > before you ask for it** > > > > *Relationship of Internet Media Type and internal version indicators* > > * * > > /(need to expand this)/ > > > > The notion of an “Internet Media Type” is very course-grained. In > general, for example, languages and formats evolve over time, and in > many cases, the evolution might involve having different kinds of > processors, or needing to know not only the general “Media Type” but the > specific version. The general approach to this has been that the actual > Media Type includes provisions for version indicator(s) embedded in the > content itself to determine more precisely the nature of how the data is > to be interpreted. That is, the message itself contains further > information. > > > > Unfortunately, lots has gone wrong in this scenario as well – processors > ignoring version indicators encouraging content creators to supply > incorrect version indicators. > > > > > > *Fragment identifiers* > > * * > > The web added this notion of being able to address part of a content and > not the whole content by adding a ‘fragment identifier’ to the URL that > addressed the data. Of course, this originally made sense for the > original web with just HTML, but how would it apply to other content. > The URL spec glibly noted that “the definition of the fragment > identifier meaning depends on the MIME type”, but unfortunately, few of > the MIME type definitions included this information, and practices > diverged greatly. > > > > *Where we need to go* > > * * > > In the above story, about MIME and the web, there is nothing about > “authoritative” and priorities. Stuff happens. There is no “license” – a > content-type header doesn’t give “permission” for the recipient to do > anything. > > * * > > We need a clear direction on how to make the web more reliable, not > less. We need a realistic transition plan from the unreliable web to the > more reliable one. Part of this is to encourage senders (web servers) to > mean what they say, and encourage recipients (browsers) to give > preference to what the senders are sending. > > > > We should try to create specifications for protocols and best practices > that will lead the web to more reliable and secure communication. To > this end, we give an overall architectural approach to use of MIME, and > then specific specifications, for HTTP clients and servers, Web Browsers > in general, proxies and intermediaries, which encourage behavior which, > on the one hand, continues to work with the already deployed > infrastructure (of servers, browsers, and intermediaries), but which > advice, if followed, also improves the operability, reliability and > security of the web. > > > > *Specific recommendations* > > > > /(I think I want to see if we can get agreement on the background, > problem statement and requirements, before sending out any more about > possible solutions, however the following is a partial list of documents > that should //be reviewed & updated, or new documents written/ > > > > update MIME / Internet Media Type registration process (IETF BCP) > > possibly URI/IRI scheme registration process (?? fragment identifier use??) > > update Tag finding on authoritative metadata > > new: MIME and Internet Media Type section to WebArch > > New: Add a W3C web architecture material on MIME in HTML to W3C web site > > update HTML spec on sniffing, versioning, MIME types, charset sniffing > > update WEBAPPS specs (which ones?) > > update sniffing spec > > / / >
Received on Monday, 31 May 2010 02:24:58 UTC