Re: MIME and the Web from Noah Mendelsohn on 2010-05-31 (www-tag@w3.org from May 2010)

From: Noah Mendelsohn <nrm@arcanedomain.com>
Date: Sun, 30 May 2010 22:24:24 -0400
To: Larry Masinter <LMM@acm.org>
Cc: "www-tag@w3.org" <www-tag@w3.org>
Message-ID: <4C031DD8.1090303@arcanedomain.com>
You wrote:

 > HTTP 0.9 assumed that what was transferred could be

Is there some text missing at the end of that line?  Thanks.

Noah

Larry Masinter wrote:
> This is a draft  (for TAG ACTION-425) for discussion at an upcoming W3C 
> TAG meeting.  In its current form, I think it may be more suitable as a 
> ‘note’, but I’d like to get TAG (and community, including W3C and IETF) 
> agreement on its contents, before going on to update various Findings, 
> BCPs and specifications.
> 
>  
> 
> * *
> 
> *MIME and the Web*
> 
>  
> 
> *Origins of MIME*
> 
>  
> 
> MIME was invented originally for email, based on general principles of 
> ‘messaging’, foundational architecture.   The role of MIME was to extend 
> Internet messaging from ASCII-only plain text (other character sets, 
>  images, rich documents, etc.) The basic architecture of complex content 
> messaging is:
> 
>  
> 
>     * Message sent from A to B.
>     * Message includes some data.   Sender A includes standard ‘headers’
>       telling recipient B enough information that recipient B knows how
>       sender  A intends the message to be interpreted.
>     * Recipient B gets the message, interprets the ‘headers’ for the
>       data and uses it as information on how to interpret the data.
> 
>  
> 
> MIME is a “tagging and bagging” specficiation:
> 
>     *  tagging: how to label content so the intent of how the content
>       should be interpreted is known
>     *  bagging: how to wrap the content so the label is clear, or, if
>       there are multiple parts to a single message, how to combine them.
> 
>  
> 
> “MIME types” (renamed “Internet Media Types”) were part of the labeling, 
> the name space of kinds of things.
> 
> The MIME type registry (“Internet Media Type registry”) is where someone 
> can tell the world what a particular label means, as far as the sender’s 
> intent.
> 
>  
> 
> *Introducing MIME into the web*
> 
>  
> 
>     The original “World Wide Web”  didn’t have MIME tagging and bagging:
> 
>     * Everything was HTML (more or less)
>     * HTTP 0.9 assumed that what was transferred could be
> 
>  
> 
> Around then, Gopher (hyperlink menu system) was quite popular , knew 
> about a couple of ‘link types’. I’d been working at Xerox PARC on a 
> system for document storage and access that used file types and allowed 
> the client to ask for the types of storage it wanted.
> 
>  
> 
> Working on Gopher, and then on WWW, the proposal (around) 1991 was that 
> Gopher and WWW should use MIME types as the vocabulary for talking about 
> file types.
> 
>  
> 
> The result was that HTTP 1.0 included type label, “content-type”, 
> following (kind of, with a couple of exceptions) MIME. Later, for 
> content negotiation, additional uses of this technology (in ‘Accept’ 
> headers) was also added.
> 
>  
> 
> The differences with MIME were minor (default charset, requirement for 
> CRLF in plain text). These minor differences have caused a lot of 
> trouble anyway, but that’s another story.
> 
>  
> 
> *Not quite a good match*
> 
>  
> 
> Unfortunately, the use of MIME for the web was a good start, but  the 
> web isn’t quite messaging:
> 
> (a)    messages are generally specifically in response to a request; 
> this means you know more about the data before you receive it. In 
> particular, the data really does have a ‘name’ (mainly, the URL used to 
> access the data), while in messaging, the messages were anonymous.
> 
> (b)   some content isn’t really delivered over the net (files on local 
> file system), or there is no opportunity for tagging (data delivered 
> over FTP) and in those cases, the additional information is crucial.
> 
>  
> 
> At the same time, operating systems were using, and continued to evolve 
> to use, different systems to determine the ‘type’ of something, 
> different from the MIME tagging and bagging:
> 
>  
> 
> a)      using ‘magic numbers’: in many contexts, file types could be 
> guessed pretty reliably by looking for headers
> 
> b)      Originally MAC OS had a 4 character ‘file type’ and another 4 
> character ‘creator code’ for file types
> 
> c)       Windows evolved to use the “file extension” – 3 letters (and 
> then more) at the end of the file name
> 
>  
> 
> This wasn’t entirely unanticipated in MIME, e.g., the MIME type registry 
> encouraged those registering MIME types to also describe ‘magic 
> numbers’, Mac file type, common file extensions.
> 
>  
> 
> *The Rules Weren’t Quite Followed*
> 
>  
> 
> a)      Lots of file types aren’t registered (no entry in IANA for file 
> types)
> 
> b)      Those that are, the registration is incomplete or incorrect 
> (people doing registration didn’t understand ‘magic number’)
> 
>  
> 
> *Bad things happened:*
> 
>  
> 
> a)      Browser implementors would be liberal in what they accepted, and 
> use file extension and/or magic number or other ‘sniffing’ techniques to 
> decide file type, without assuming content-label was authoritative. This 
> was necessary anyway for files that weren’t delivered by HTTP.
> 
> b)      HTTP server implementors and administrators didn’t supply ways 
> of easily associating the ‘intended’ file type label with the file, 
> resulting in files frequently being delivered with a label other than 
> the one they would have chosen if they’d thought about it, and if 
> browsers **had** assumed content-type was authoritative.
> 
>  
> 
> Which of these happened first doesn’t quite matter (most likely a, then 
> b), but it’s a viscous cycle, anyway.
> 
>  
> 
> *Result is not good:*
> 
>  
> 
> Result, though, is that the web is unreliable, in that servers sending 
> responses to browsers don’t have a good guarantee that the browser won’t 
> “sniff” the content and decide to do something other than treat it as it 
> is labeled, and browsers receiving content don’t have a good guarantee 
> that the content isn’t mis-labeled, and intermediaries like gateways, 
> proxies, caches, and other pieces of the web infrastructure don’t have a 
> good way of telling what the conversation means. 
> 
>  
> 
> This ambiguity and ‘sniffing’ also applies to packaged content in 
> webapps (‘bagging’ but using ZIP rather than MIME multipart).
> 
>  
> 
> *Extensibility, content negotiation*
> 
>  
> 
> Adding MIME to the web introduced an enormous path for extensibility of 
> the web. The fact that HTTP could reliably transport images allowed NCSA 
> to add img to HTML and reliably deliver multiple image types. The 
> addition of MIME allowed other document formats (Word, PDF, Postscript) 
> and other kinds of hypermedia, as well as applications. MIME was an 
> important engine for extensibility in messaging.  Of course, 
> extensibility has its own problems. When senders use extensions 
> recipients aren’t aware of, implement incorrectly or incompletely, then 
> communication often fails.  With messaging, this is a serious problem, 
> although most ‘rich text’ documents are still delivered in multiple 
> forms (using multipart/alternative). With the web, the idea was to 
> provide ‘content negotiation’, but basing content negotiation solely on 
> Internet Media Types has some serious (fatal) drawbacks.
> 
>  
> 
> *The MIME story covers charsets as well*
> 
>  
> 
> While the above tale was written about Internet Media Types, the same 
> kind of vicious cycle also happened with character set labels: 
> mislabeled content happily processed correctly by liberal browsers 
> encouraged more and more sites to proliferate text with  mis-labeled 
> character sets, to the point where browsers feel they **have** to guess 
> the wrong label.
> 
>  
> 
> *Some additional requirements*
> 
> * *
> 
> The specifications for MIME and Internet Media Types and, its design, 
> may have some additional requirements that haven’t been explored well. 
> There are two particularly interesting use cases:
> 
>  
> 
>     * * "Polyglot" documents:*  A ‘polyglot’ document is one which is
>       some data which can be treated as two different Internet Media
>       Types, in the case where the meaning of the data is the same. This
>       is part of a transition strategy to allow content providers
>       (senders) to manage, produce, store, deliver the same data, but
>       with two different labels, and have it work equivalently with two
>       different kinds of receivers (one of which knows one Internet
>       Media Type, and another which knows a second one.) This use case
>       was part of the transition strategy from HTML to an XML-based
>       XHTML, and also as a way of a single service offering both
>       HTML-based and XML-based processing (e.g., same content useful for
>       news articles and web pages.
> 
>  
> 
>     * "*Alternate views”:* This use case seems similar but it’s quite
>       different. This is the use case where the same data, has very
>       different meaning when served as two different content-types, but
>       that difference is intentional; for example, the same data served
>       as text/html is a document, and served as an RDFa type is some
>       specific data. (not sure what to call these).
> 
>  
> 
> *Some additional things people would like to do that are harder*
> 
> * *
> 
> /(want to expand these later, park desirata here):/
> 
> / /
> 
>     * distinguish different versions with different headers or parameters**
>     * content negotiation**
>     * knowing the type of something isn’t something you can handle
>       before you ask for it**
> 
>  
> 
> *Relationship of Internet Media Type and internal version indicators*
> 
> * *
> 
> /(need to expand this)/
> 
>  
> 
> The notion of an “Internet Media Type” is very course-grained. In 
> general, for example, languages and formats evolve over time, and in 
> many cases, the evolution might involve having different kinds of 
> processors, or needing to know not only the general “Media Type” but the 
> specific version. The general approach to this has been that the actual 
> Media Type includes provisions for version indicator(s)  embedded in the 
> content itself to determine more precisely the nature of how the data is 
> to be interpreted.  That is, the message itself contains further 
> information.  
> 
>  
> 
> Unfortunately, lots has gone wrong in this scenario as well – processors 
> ignoring version indicators encouraging content creators to supply 
> incorrect version indicators.
> 
>  
> 
>  
> 
> *Fragment identifiers*
> 
> * *
> 
> The web added this notion of being able to address part of a content and 
> not the whole content by adding a ‘fragment identifier’ to the URL that 
> addressed the data. Of course, this originally made sense for the 
> original web with just HTML, but how would it apply to other content. 
> The URL spec glibly noted that “the definition of the fragment 
> identifier meaning depends on the MIME type”, but unfortunately, few of 
> the MIME type definitions included this information, and practices 
> diverged greatly.
> 
>  
> 
> *Where we need to go*
> 
> * *
> 
> In the above story, about MIME and the web, there is nothing about 
> “authoritative” and priorities. Stuff happens. There is no “license” – a 
> content-type header doesn’t give “permission” for the recipient to do 
> anything.
> 
> * *
> 
> We need a clear direction on how to make the web more reliable, not 
> less. We need a realistic transition plan from the unreliable web to the 
> more reliable one. Part of this is to encourage senders (web servers) to 
> mean what they say, and encourage recipients (browsers) to give 
> preference to what the senders are sending.
> 
>  
> 
> We should try to create specifications for protocols and best practices 
> that will lead the web to more reliable and secure communication. To 
> this end, we give an overall architectural approach to use of MIME, and 
> then specific specifications, for HTTP clients and servers, Web Browsers 
> in general, proxies and intermediaries, which encourage behavior which, 
> on the one hand, continues to work with the already deployed 
> infrastructure (of servers, browsers, and intermediaries), but which 
> advice, if followed, also improves the operability, reliability and 
> security of the web.
> 
>  
> 
> *Specific recommendations*
> 
>  
> 
> /(I think I want to see if we can get agreement on the background, 
> problem statement and requirements, before sending out any more about 
> possible solutions, however the following is a partial list of documents 
> that should //be reviewed & updated, or new documents written/
> 
>  
> 
> update MIME / Internet Media Type registration process (IETF BCP)
> 
> possibly URI/IRI scheme registration process (?? fragment identifier use??)
> 
> update Tag finding on authoritative metadata
> 
> new:  MIME and Internet Media Type section to WebArch
> 
> New: Add a W3C web architecture material on MIME in HTML to W3C web site
> 
> update HTML spec on sniffing, versioning, MIME types, charset sniffing
> 
> update WEBAPPS specs (which ones?)
> 
> update sniffing spec
> 
> / /
>
Received on Monday, 31 May 2010 02:24:58 UTC