Re: MIME and the Web from Graham Klyne on 2010-05-31 (www-tag@w3.org from May 2010)

From: Graham Klyne <GK-lists@ninebynine.org>
Date: Mon, 31 May 2010 08:40:38 +0100
To: Larry Masinter <LMM@acm.org>
CC: www-tag@w3.org
Message-ID: <4C0367F6.9010809@ninebynine.org>
Larry Masinter wrote:
 >
 > /(I think I want to see if we can get agreement on the background,
 > problem statement and requirements, before sending out any more about
 > possible solutions, ...

In all respects that matter, I think this is a good articulation of a problem 
area.  I particular, I agree with the framing of the goal as...

 > advice, if followed, also improves the operability, reliability and
 > security of the web.

...

Reading Larry's piece, it's easy to get the impression that using MIME in the 
Web has somehow been a failure.  But this comment of Dan Conolly's from 1992 
(!), picked out by danbri, emphasizes to me what a phenomenal success it has been:

[[
If they all adopt the MIME typing system (and as many other features
from MIME as are appropriate), we can step from global hypertext to
global hypermedia that much easier."
]]

Quite so!

So I'd parenthetically add to Larry's framing of the goal: "builds on the 
success of what has been achieved so far".   I.e. to learn from what worked as 
well as what did not.  I'd say the default position should be that MIME is right 
until proven wrong.

#g
--


Larry Masinter wrote:
> 
> 
> This is a draft  (for TAG ACTION-425) for discussion at an upcoming W3C 
> TAG meeting.  In its current form, I think it may be more suitable as a 
> ‘note’, but I’d like to get TAG (and community, including W3C and IETF) 
> agreement on its contents, before going on to update various Findings, 
> BCPs and specifications.
> 
>  
> 
> *   *
> 
> *MIME and the Web *
> 
>  
> 
> *Origins of MIME *
> 
>  
> 
> MIME was invented originally for email, based on general principles of 
> ‘messaging’, foundational architecture.   The role of MIME was to extend 
> Internet messaging from ASCII-only plain text (other character sets, 
>  images, rich documents, etc.) The basic architecture of complex content 
> messaging is:
> 
>  
> 
>     * Message sent from A to B.
>     * Message includes some data.   Sender A includes standard ‘headers’
>       telling recipient B enough information that recipient B knows how
>       sender  A intends the message to be interpreted.
>     * Recipient B gets the message, interprets the ‘headers’ for the
>       data and uses it as information on how to interpret the data.
> 
>  
> 
> MIME is a “tagging and bagging” specficiation:
> 
>     *  tagging: how to label content so the intent of how the content
>       should be interpreted is known
>     *  bagging: how to wrap the content so the label is clear, or, if
>       there are multiple parts to a single message, how to combine them.
> 
>  
> 
> “MIME types” (renamed “Internet Media Types”) were part of the labeling, 
> the name space of kinds of things.
> 
> The MIME type registry (“Internet Media Type registry”) is where someone 
> can tell the world what a particular label means, as far as the sender’s 
> intent.
> 
>  
> 
> *Introducing MIME into the web *
> 
>  
> 
>     The original “World Wide Web”  didn’t have MIME tagging and bagging:
> 
>     * Everything was HTML (more or less)
>     * HTTP 0.9 assumed that what was transferred could be
> 
>  
> 
> Around then, Gopher (hyperlink menu system) was quite popular , knew 
> about a couple of ‘link types’. I’d been working at Xerox PARC on a 
> system for document storage and access that used file types and allowed 
> the client to ask for the types of storage it wanted.
> 
>  
> 
> Working on Gopher, and then on WWW, the proposal (around) 1991 was that 
> Gopher and WWW should use MIME types as the vocabulary for talking about 
> file types.
> 
>  
> 
> The result was that HTTP 1.0 included type label, “content-type”, 
> following (kind of, with a couple of exceptions) MIME. Later, for 
> content negotiation, additional uses of this technology (in ‘Accept’ 
> headers) was also added.
> 
>  
> 
> The differences with MIME were minor (default charset, requirement for 
> CRLF in plain text). These minor differences have caused a lot of 
> trouble anyway, but that’s another story.
> 
>  
> 
> *Not quite a good match *
> 
>  
> 
> Unfortunately, the use of MIME for the web was a good start, but  the 
> web isn’t quite messaging:
> 
> (a)    messages are generally specifically in response to a request; 
> this means you know more about the data before you receive it. In 
> particular, the data really does have a ‘name’ (mainly, the URL used to 
> access the data), while in messaging, the messages were anonymous.
> 
> (b)   some content isn’t really delivered over the net (files on local 
> file system), or there is no opportunity for tagging (data delivered 
> over FTP) and in those cases, the additional information is crucial.
> 
>  
> 
> At the same time, operating systems were using, and continued to evolve 
> to use, different systems to determine the ‘type’ of something, 
> different from the MIME tagging and bagging:
> 
>  
> 
> a)      using ‘magic numbers’: in many contexts, file types could be 
> guessed pretty reliably by looking for headers
> 
> b)      Originally MAC OS had a 4 character ‘file type’ and another 4 
> character ‘creator code’ for file types
> 
> c)       Windows evolved to use the “file extension” – 3 letters (and 
> then more) at the end of the file name
> 
>  
> 
> This wasn’t entirely unanticipated in MIME, e.g., the MIME type registry 
> encouraged those registering MIME types to also describe ‘magic 
> numbers’, Mac file type, common file extensions.
> 
>  
> 
> *The Rules Weren’t Quite Followed *
> 
>  
> 
> a)      Lots of file types aren’t registered (no entry in IANA for file 
> types)
> 
> b)      Those that are, the registration is incomplete or incorrect 
> (people doing registration didn’t understand ‘magic number’)
> 
>  
> 
> *Bad things happened: *
> 
>  
> 
> a)      Browser implementors would be liberal in what they accepted, and 
> use file extension and/or magic number or other ‘sniffing’ techniques to 
> decide file type, without assuming content-label was authoritative. This 
> was necessary anyway for files that weren’t delivered by HTTP.
> 
> b)      HTTP server implementors and administrators didn’t supply ways 
> of easily associating the ‘intended’ file type label with the file, 
> resulting in files frequently being delivered with a label other than 
> the one they would have chosen if they’d thought about it, and if 
> browsers **had** assumed content-type was authoritative.
> 
>  
> 
> Which of these happened first doesn’t quite matter (most likely a, then 
> b), but it’s a viscous cycle, anyway.
> 
>  
> 
> *Result is not good: *
> 
>  
> 
> Result, though, is that the web is unreliable, in that servers sending 
> responses to browsers don’t have a good guarantee that the browser won’t 
> “sniff” the content and decide to do something other than treat it as it 
> is labeled, and browsers receiving content don’t have a good guarantee 
> that the content isn’t mis-labeled, and intermediaries like gateways, 
> proxies, caches, and other pieces of the web infrastructure don’t have a 
> good way of telling what the conversation means. 
> 
>  
> 
> This ambiguity and ‘sniffing’ also applies to packaged content in 
> webapps (‘bagging’ but using ZIP rather than MIME multipart).
> 
>  
> 
> *Extensibility, content negotiation *
> 
>  
> 
> Adding MIME to the web introduced an enormous path for extensibility of 
> the web. The fact that HTTP could reliably transport images allowed NCSA 
> to add img to HTML and reliably deliver multiple image types. The 
> addition of MIME allowed other document formats (Word, PDF, Postscript) 
> and other kinds of hypermedia, as well as applications. MIME was an 
> important engine for extensibility in messaging.  Of course, 
> extensibility has its own problems. When senders use extensions 
> recipients aren’t aware of, implement incorrectly or incompletely, then 
> communication often fails.  With messaging, this is a serious problem, 
> although most ‘rich text’ documents are still delivered in multiple 
> forms (using multipart/alternative). With the web, the idea was to 
> provide ‘content negotiation’, but basing content negotiation solely on 
> Internet Media Types has some serious (fatal) drawbacks.
> 
>  
> 
> *The MIME story covers charsets as well *
> 
>  
> 
> While the above tale was written about Internet Media Types, the same 
> kind of vicious cycle also happened with character set labels: 
> mislabeled content happily processed correctly by liberal browsers 
> encouraged more and more sites to proliferate text with  mis-labeled 
> character sets, to the point where browsers feel they **have** to guess 
> the wrong label.
> 
>  
> 
> *Some additional requirements *
> 
> *   *
> 
> The specifications for MIME and Internet Media Types and, its design, 
> may have some additional requirements that haven’t been explored well. 
> There are two particularly interesting use cases:
> 
>  
> 
>     * * "Polyglot" documents:*  A ‘polyglot’ document is one which is
>       some data which can be treated as two different Internet Media
>       Types, in the case where the meaning of the data is the same. This
>       is part of a transition strategy to allow content providers
>       (senders) to manage, produce, store, deliver the same data, but
>       with two different labels, and have it work equivalently with two
>       different kinds of receivers (one of which knows one Internet
>       Media Type, and another which knows a second one.) This use case
>       was part of the transition strategy from HTML to an XML-based
>       XHTML, and also as a way of a single service offering both
>       HTML-based and XML-based processing (e.g., same content useful for
>       news articles and web pages.
> 
>  
> 
>     * "*Alternate views”:* This use case seems similar but it’s quite
>       different. This is the use case where the same data, has very
>       different meaning when served as two different content-types, but
>       that difference is intentional; for example, the same data served
>       as text/html is a document, and served as an RDFa type is some
>       specific data. (not sure what to call these).
> 
>  
> 
> *Some additional things people would like to do that are harder *
> 
> *   *
> 
> /(want to expand these later, park desirata here): /
> 
> /   /
> 
>     * distinguish different versions with different headers or parameters* *
>     * content negotiation* *
>     * knowing the type of something isn’t something you can handle
>       before you ask for it* *
> 
>  
> 
> *Relationship of Internet Media Type and internal version indicators *
> 
> *   *
> 
> /(need to expand this) /
> 
>  
> 
> The notion of an “Internet Media Type” is very course-grained. In 
> general, for example, languages and formats evolve over time, and in 
> many cases, the evolution might involve having different kinds of 
> processors, or needing to know not only the general “Media Type” but the 
> specific version. The general approach to this has been that the actual 
> Media Type includes provisions for version indicator(s)  embedded in the 
> content itself to determine more precisely the nature of how the data is 
> to be interpreted.  That is, the message itself contains further 
> information.  
> 
>  
> 
> Unfortunately, lots has gone wrong in this scenario as well – processors 
> ignoring version indicators encouraging content creators to supply 
> incorrect version indicators.
> 
>  
> 
>  
> 
> *Fragment identifiers *
> 
> *   *
> 
> The web added this notion of being able to address part of a content and 
> not the whole content by adding a ‘fragment identifier’ to the URL that 
> addressed the data. Of course, this originally made sense for the 
> original web with just HTML, but how would it apply to other content. 
> The URL spec glibly noted that “the definition of the fragment 
> identifier meaning depends on the MIME type”, but unfortunately, few of 
> the MIME type definitions included this information, and practices 
> diverged greatly.
> 
>  
> 
> *Where we need to go *
> 
> *   *
> 
> In the above story, about MIME and the web, there is nothing about 
> “authoritative” and priorities. Stuff happens. There is no “license” – a 
> content-type header doesn’t give “permission” for the recipient to do 
> anything.
> 
> *   *
> 
> We need a clear direction on how to make the web more reliable, not 
> less. We need a realistic transition plan from the unreliable web to the 
> more reliable one. Part of this is to encourage senders (web servers) to 
> mean what they say, and encourage recipients (browsers) to give 
> preference to what the senders are sending.
> 
>  
> 
> We should try to create specifications for protocols and best practices 
> that will lead the web to more reliable and secure communication. To 
> this end, we give an overall architectural approach to use of MIME, and 
> then specific specifications, for HTTP clients and servers, Web Browsers 
> in general, proxies and intermediaries, which encourage behavior which, 
> on the one hand, continues to work with the already deployed 
> infrastructure (of servers, browsers, and intermediaries), but which 
> advice, if followed, also improves the operability, reliability and 
> security of the web.
> 
>  
> 
> *Specific recommendations *
> 
>  
> 
> /(I think I want to see if we can get agreement on the background, 
> problem statement and requirements, before sending out any more about 
> possible solutions, however the following is a partial list of documents 
> that should //be reviewed & updated, or new documents written /
> 
>  
> 
> update MIME / Internet Media Type registration process (IETF BCP)
> 
> possibly URI/IRI scheme registration process (?? fragment identifier use??)
> 
> update Tag finding on authoritative metadata
> 
> new:  MIME and Internet Media Type section to WebArch
> 
> New: Add a W3C web architecture material on MIME in HTML to W3C web site
> 
> update HTML spec on sniffing, versioning, MIME types, charset sniffing
> 
> update WEBAPPS specs (which ones?)
> 
> update sniffing spec
> 
> /   /
>
Received on Monday, 31 May 2010 07:41:54 UTC