Re: MIME and the Web from Yves Lafon on 2010-06-03 (www-tag@w3.org from June 2010)

From: Yves Lafon <ylafon@w3.org>
Date: Thu, 3 Jun 2010 05:58:23 -0400 (EDT)
To: Larry Masinter <LMM@acm.org>
cc: www-tag@w3.org
Message-ID: <alpine.DEB.1.10.1006030549130.10630@wnl.j3.bet>
On Sat, 29 May 2010, Larry Masinter wrote:

> This is a draft  (for TAG ACTION-425) for discussion at an upcoming
> W3C TAG meeting.  In its current form, I think it may be more suitable
> as a 'note', but I'd like to get TAG (and community, including W3C and
> IETF) agreement on its contents, before going on to update various
> Findings, BCPs and specifications.

Relative to Fragment identifiers, there is also an httpbis issue (#43) 
about fragment combination/precedence in redirects [1], which indicate 
that not only MIME type definitions forget to define how to handle 
fragments, but that there are other unspecified behaviours in 
fragment handling. (See also the thread that was cross-posted to www-tag 
at some point [2] [3]).

Apart from adding this to the current "MIME and the Web" text, is [1] an 
issue the TAG wants to address?

[1] http://trac.tools.ietf.org/wg/httpbis/trac/ticket/43
[2] http://lists.w3.org/Archives/Public/ietf-http-wg/2010JanMar/0256.html
[3] http://lists.w3.org/Archives/Public/www-tag/2010Mar/0040.html

> MIME and the Web
>
>
>
> Origins of MIME
>
>
>
> MIME was invented originally for email, based on general principles of
> 'messaging', foundational architecture.   The role of MIME was to
> extend Internet messaging from ASCII-only plain text (other character
> sets,  images, rich documents, etc.) The basic architecture of complex
> content messaging is:
>
>
>
> *	Message sent from A to B.
> *	Message includes some data.   Sender A includes standard
> 'headers' telling recipient B enough information that recipient B
> knows how sender  A intends the message to be interpreted.
> *	Recipient B gets the message, interprets the 'headers' for the
> data and uses it as information on how to interpret the data.
>
>
>
> MIME is a "tagging and bagging" specficiation:
>
> *	 tagging: how to label content so the intent of how the
> content should be interpreted is known
> *	 bagging: how to wrap the content so the label is clear, or,
> if there are multiple parts to a single message, how to combine them.
>
>
>
> "MIME types" (renamed "Internet Media Types") were part of the
> labeling, the name space of kinds of things.
>
> The MIME type registry ("Internet Media Type registry") is where
> someone can tell the world what a particular label means, as far as
> the sender's intent.
>
>
>
> Introducing MIME into the web
>
>
>
>    The original "World Wide Web"  didn't have MIME tagging and
> bagging:
>
> *	Everything was HTML (more or less)
> *	HTTP 0.9 assumed that what was transferred could be
>
>
>
> Around then, Gopher (hyperlink menu system) was quite popular , knew
> about a couple of 'link types'. I'd been working at Xerox PARC on a
> system for document storage and access that used file types and
> allowed the client to ask for the types of storage it wanted.
>
>
>
> Working on Gopher, and then on WWW, the proposal (around) 1991 was
> that Gopher and WWW should use MIME types as the vocabulary for
> talking about file types.
>
>
>
> The result was that HTTP 1.0 included type label, "content-type",
> following (kind of, with a couple of exceptions) MIME. Later, for
> content negotiation, additional uses of this technology (in 'Accept'
> headers) was also added.
>
>
>
> The differences with MIME were minor (default charset, requirement for
> CRLF in plain text). These minor differences have caused a lot of
> trouble anyway, but that's another story.
>
>
>
> Not quite a good match
>
>
>
> Unfortunately, the use of MIME for the web was a good start, but  the
> web isn't quite messaging:
>
> (a)    messages are generally specifically in response to a request;
> this means you know more about the data before you receive it. In
> particular, the data really does have a 'name' (mainly, the URL used
> to access the data), while in messaging, the messages were anonymous.
>
> (b)   some content isn't really delivered over the net (files on local
> file system), or there is no opportunity for tagging (data delivered
> over FTP) and in those cases, the additional information is crucial.
>
>
>
> At the same time, operating systems were using, and continued to
> evolve to use, different systems to determine the 'type' of something,
> different from the MIME tagging and bagging:
>
>
>
> a)      using 'magic numbers': in many contexts, file types could be
> guessed pretty reliably by looking for headers
>
> b)      Originally MAC OS had a 4 character 'file type' and another 4
> character 'creator code' for file types
>
> c)       Windows evolved to use the "file extension" - 3 letters (and
> then more) at the end of the file name
>
>
>
> This wasn't entirely unanticipated in MIME, e.g., the MIME type
> registry encouraged those registering MIME types to also describe
> 'magic numbers', Mac file type, common file extensions.
>
>
>
> The Rules Weren't Quite Followed
>
>
>
> a)      Lots of file types aren't registered (no entry in IANA for
> file types)
>
> b)      Those that are, the registration is incomplete or incorrect
> (people doing registration didn't understand 'magic number')
>
>
>
> Bad things happened:
>
>
>
> a)      Browser implementors would be liberal in what they accepted,
> and use file extension and/or magic number or other 'sniffing'
> techniques to decide file type, without assuming content-label was
> authoritative. This was necessary anyway for files that weren't
> delivered by HTTP.
>
> b)      HTTP server implementors and administrators didn't supply ways
> of easily associating the 'intended' file type label with the file,
> resulting in files frequently being delivered with a label other than
> the one they would have chosen if they'd thought about it, and if
> browsers *had* assumed content-type was authoritative.
>
>
>
> Which of these happened first doesn't quite matter (most likely a,
> then b), but it's a viscous cycle, anyway.
>
>
>
> Result is not good:
>
>
>
> Result, though, is that the web is unreliable, in that servers sending
> responses to browsers don't have a good guarantee that the browser
> won't "sniff" the content and decide to do something other than treat
> it as it is labeled, and browsers receiving content don't have a good
> guarantee that the content isn't mis-labeled, and intermediaries like
> gateways, proxies, caches, and other pieces of the web infrastructure
> don't have a good way of telling what the conversation means.
>
>
>
> This ambiguity and 'sniffing' also applies to packaged content in
> webapps ('bagging' but using ZIP rather than MIME multipart).
>
>
>
> Extensibility, content negotiation
>
>
>
> Adding MIME to the web introduced an enormous path for extensibility
> of the web. The fact that HTTP could reliably transport images allowed
> NCSA to add img to HTML and reliably deliver multiple image types. The
> addition of MIME allowed other document formats (Word, PDF,
> Postscript) and other kinds of hypermedia, as well as applications.
> MIME was an important engine for extensibility in messaging.  Of
> course, extensibility has its own problems. When senders use
> extensions recipients aren't aware of, implement incorrectly or
> incompletely, then communication often fails.  With messaging, this is
> a serious problem, although most 'rich text' documents are still
> delivered in multiple forms (using multipart/alternative). With the
> web, the idea was to provide 'content negotiation', but basing content
> negotiation solely on Internet Media Types has some serious (fatal)
> drawbacks.
>
>
>
> The MIME story covers charsets as well
>
>
>
> While the above tale was written about Internet Media Types, the same
> kind of vicious cycle also happened with character set labels:
> mislabeled content happily processed correctly by liberal browsers
> encouraged more and more sites to proliferate text with  mis-labeled
> character sets, to the point where browsers feel they *have* to guess
> the wrong label.
>
>
>
> Some additional requirements
>
>
>
> The specifications for MIME and Internet Media Types and, its design,
> may have some additional requirements that haven't been explored well.
> There are two particularly interesting use cases:
>
>
>
> *	 "Polyglot" documents:  A 'polyglot' document is one which is
> some data which can be treated as two different Internet Media Types,
> in the case where the meaning of the data is the same. This is part of
> a transition strategy to allow content providers (senders) to manage,
> produce, store, deliver the same data, but with two different labels,
> and have it work equivalently with two different kinds of receivers
> (one of which knows one Internet Media Type, and another which knows a
> second one.) This use case was part of the transition strategy from
> HTML to an XML-based XHTML, and also as a way of a single service
> offering both HTML-based and XML-based processing (e.g., same content
> useful for news articles and web pages.
>
>
>
> *	"Alternate views": This use case seems similar but it's quite
> different. This is the use case where the same data, has very
> different meaning when served as two different content-types, but that
> difference is intentional; for example, the same data served as
> text/html is a document, and served as an RDFa type is some specific
> data. (not sure what to call these).
>
>
>
> Some additional things people would like to do that are harder
>
>
>
> (want to expand these later, park desirata here):
>
>
>
> *	distinguish different versions with different headers or
> parameters
> *	content negotiation
> *	knowing the type of something isn't something you can handle
> before you ask for it
>
>
>
> Relationship of Internet Media Type and internal version indicators
>
>
>
> (need to expand this)
>
>
>
> The notion of an "Internet Media Type" is very course-grained. In
> general, for example, languages and formats evolve over time, and in
> many cases, the evolution might involve having different kinds of
> processors, or needing to know not only the general "Media Type" but
> the specific version. The general approach to this has been that the
> actual Media Type includes provisions for version indicator(s)
> embedded in the content itself to determine more precisely the nature
> of how the data is to be interpreted.  That is, the message itself
> contains further information.
>
>
>
> Unfortunately, lots has gone wrong in this scenario as well -
> processors ignoring version indicators encouraging content creators to
> supply incorrect version indicators.
>
>
>
>
>
> Fragment identifiers
>
>
>
> The web added this notion of being able to address part of a content
> and not the whole content by adding a 'fragment identifier' to the URL
> that addressed the data. Of course, this originally made sense for the
> original web with just HTML, but how would it apply to other content.
> The URL spec glibly noted that "the definition of the fragment
> identifier meaning depends on the MIME type", but unfortunately, few
> of the MIME type definitions included this information, and practices
> diverged greatly.
>
>
>
> Where we need to go
>
>
>
> In the above story, about MIME and the web, there is nothing about
> "authoritative" and priorities. Stuff happens. There is no "license" -
> a content-type header doesn't give "permission" for the recipient to
> do anything.
>
>
>
> We need a clear direction on how to make the web more reliable, not
> less. We need a realistic transition plan from the unreliable web to
> the more reliable one. Part of this is to encourage senders (web
> servers) to mean what they say, and encourage recipients (browsers) to
> give preference to what the senders are sending.
>
>
>
> We should try to create specifications for protocols and best
> practices that will lead the web to more reliable and secure
> communication. To this end, we give an overall architectural approach
> to use of MIME, and then specific specifications, for HTTP clients and
> servers, Web Browsers in general, proxies and intermediaries, which
> encourage behavior which, on the one hand, continues to work with the
> already deployed infrastructure (of servers, browsers, and
> intermediaries), but which advice, if followed, also improves the
> operability, reliability and security of the web.
>
>
>
> Specific recommendations
>
>
>
> (I think I want to see if we can get agreement on the background,
> problem statement and requirements, before sending out any more about
> possible solutions, however the following is a partial list of
> documents that should be reviewed & updated, or new documents written
>
>
>
> update MIME / Internet Media Type registration process (IETF BCP)
>
> possibly URI/IRI scheme registration process (?? fragment identifier
> use??)
>
> update Tag finding on authoritative metadata
>
> new:  MIME and Internet Media Type section to WebArch
>
> New: Add a W3C web architecture material on MIME in HTML to W3C web
> site
>
> update HTML spec on sniffing, versioning, MIME types, charset sniffing
>
> update WEBAPPS specs (which ones?)
>
> update sniffing spec
>
>
>
>

-- 
Baroula que barouleras, au tiéu toujou t'entourneras.

         ~~Yves
Received on Thursday, 3 June 2010 09:58:26 UTC