Re: Feedback on Internet Media Types and the Web from Henri Sivonen on 2010-11-09 (www-tag@w3.org from November 2010)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 9 Nov 2010 17:11:16 +0200
To: Larry Masinter <masinter@adobe.com>
Cc: "julian.reschke@gmx.de" <julian.reschke@gmx.de>, "www-tag@w3.org WG" <www-tag@w3.org>, Alexey Melnikov <alexey.melnikov@isode.com>
Message-Id: <004783D1-BDA6-4287-A06A-C08E787782E1@iki.fi>
On Nov 8, 2010, at 22:46, Larry Masinter wrote:

> " The document doesn't recount how dysfunctional the MIME type registry has been. "
> 
> 
> I was hoping to be more explicit about the nature of the problems, as well as
> about what the desired state is, rather than just saying it is "dysfunctional".
> MIME has functioned very well for many applications and types; a few cases where
> things have gone awry doesn't merit the term.

Are there any recent success stories with types that aren't vendor types?

> What were the problems with image/svg+xml, image/jp2 and/or video/mp4?

The problem with image/svg+xml is that after a decade of deployment and W3C REC status, the type still isn't in the registry. Even if the IETF experts found something wrong with the type, it would be way too late to stop its deployment, so there's really no point in subjecting it to expert review at this point.

The problem with image/jp2 and video/mp4 was that they were unregistered at the point where the specs had been published and implementations that assumed these types were being deployed. Again, when implementations are being deployed, it's really too late to review the types, so failure to have the types in the registry at that point was a potential source of confusion, but there was no benefit from deferring the appearance of the types in the registry.

Another case of deployment happening ahead of registration that I recall is the type registration for RELAX NG Compact Syntax. Another type was already seen in the wild:
http://code.google.com/p/jing-trang/issues/detail?id=55

Yet another failure of the registry is that text/xsl isn't registered for XSLT.

> Should these be registered even if the requirements for MIME type registration
> weren't met? Or did they meet the requirements but the process dropped the ball?

I don't know what exactly has happened with the registration for each of these types. I'm just observing that the outcome was that the system didn't work in the sense that the registry wasn't the place where a Web author, a Web server administrator or a Web client software developer could go and find what the right MIME type for a given format is.

> As for image/svg+xml not being used for 'XML' format. I think this is a 3023bis issue?

Do you mean sending gzipped data as image/svg+xml without Content-Encoding: gzip?

> Is this different from mime-for-EMail vs mime-for-Web?

It seems to me that the failure to register image/svg+xml is a failure of the registration system in general, but it's more relevant for the Web, because SVG isn't being deployed as an email message body type.

> ===================
>> section "3.1.  Differences between email and web delivery" 
>> doesn't elaborate on the CRLF issue.
> 
> 
> As for restrictions on text/* types: for things like signatures for text types,
> it did seem useful to maintain that the "canonical form" of a text/* type used
> CRLF, but that HTTP allowed transport of non-canonical forms, even when email
> didn't.
> 
> At least that's how we decided to "cut the knot" when we looked at this problem
> in the HTTP working group many years ago. Is this not a feasible direction? Does
> the restriction on using CRLF need to be removed in other contexts too?

I think the CRLF restriction doesn't make sense for any context that can transfer arbitrary byte data and uses MIME for labeling, so it would make sense to have a special rule about CRLF only for email when not base64-encoded.

> ================
>> Section "4.1.  There are related problems with charsets" doesn't 
>> sufficiently rebuke the IETF for the supposed US-ASCII default
>> for text/* types.
> 
> It would be foolish to "rebuke" the IETF for a restriction that made perfect
> sense at the time it was made,

What went wrong was the failure to move ahead with the times.

> and for which there is no compelling case made
> for introducing a backward incompatibility.

If XML's built-in encoding rules don't seem compelling, I really don't know what to say.

> I don't see the reason *not* to use application/* for nearly everything.
> The MIME top level types have really pretty limited utility. Why is it important
> that people text/* types instead? If the text/ types you're defining don't
> really match the requirements set out for text/ ... why not just use application/?

People think "if there's a text-based format named Foo, the MIME type is text/foo". Requiring application/foo fails the principle of least surprise.

Furthermore, in the cases of text/xml, text/javascript and text/xsl, the text/* types had already been deployed, so telling people to switch to application/* meant asking a lot of people to do rather pointless work that could have been avoided by changing policies so that the deployed types could have been declared correct.

And finally, since text/html and text/css aren't going to switch to application/*, no one is going to be able to perform generic text/* processing as envisioned for text/* originally. Thus, the insistence of application/* is useless in practice anyway.

> =============================
> 
>> The document doesn't sufficiently acknowledge that for most binary file formats 
>> (particularly image files), the "magic number" of the file format is a much more
>> reliable indicator of the format than an out-of-band MIME type,
> 
> First: I'm not sure this is true. I know there are circumstances where the
> content-type label is wrong and sniffing gives the right answer, but there
> are also circumstances where the label is right and sniffing gives the
> wrong answer. So which is more prevalent, really? Do we have more data
> than scattered anecdotes?

It seems rather implausible that there'd be more files that accidentally have the magic number for an image file format, a video file format, zip, gzip or PDF than there are mislabeled files in these formats, but I don't have data based on Web crawls followed by manual inspection. It's well known, though, that browsers, in order to be Web-compatible, ignore the image subtype for binary formats and sniff the magic number instead.

> Secondly, I'm not convinced that even if it is true now that the right thing 
> to do is to give up on trying to get explicit MIME type indicators to work. 

I agree that it's now too late to give up on MIME entirely, since we now have types that don't have reliable magic numbers (in particular HTML, XML, CSS and JavaScript). However, if the purpose of the document is to document what went wrong or what could have gone better, I think specifying magic numbers as the step forward from HTTP 0.9 so that textual types would have been forced to have reliable magic numbers could have lead to a more robust outcome than the one we got.

>> " an architecture that insists on using out-of-band type data and on the
>> out-of-band type data being authoritative has largely been unproductive"
> 
> in what way has it been "unproductive"? 

All the time wasted due to MIME labeling failures could have been avoided when formats have reliable magic numbers.

> logically: those who are closer to the source of the data are more likely
> to know authoritatively about the nature of the data than those who are
> further down the pipeline. 

Indeed. A program writing an image, audio or video file is more likely to get the in-band magic number right than the HTTP server is likely to get the MIME type right.

> I know this topic has been discussed at length under the "authoritative metadata"
> TAG finding. Doesn't it depend at least a bit on whether the MIME labels
> are being applied by HTTP servers (e.g., Apache / IIS) vs. email clients
> sending attachments?

In both cases, the agent doing the labeling knows less than an email agent labeling a message body that it generated itself.

> Part of the problem is that sniffing is implemented inconsistently, and
> when different parties do it differently, there are security and reliability
> problems. But there's little hope of getting convergence on any position
> *other* than "content-type is authoritative", when looking at the broader
> range of Internet applications.

I think there's little hope of converging on "content-type is authoritative" when the broader range of Internet applications includes the Web. I think the approach Adam Barth has advocated makes sense. (If you sniff, sniff exactly like this. Where "like this" doesn't allow a "safe" type to be escalated into text/html but text/html can be downgraded to a safer type.)

>> Section "4.5.  Content Negotiation" doesn't properly acknowledge that
>> content negotiation on axes other than lossless compression (gzip) is
>> mostly a failure on the Web.
> 
> But "user-agent" content negotiation is widespread, common, 
> and quite functional.

"Negotiating" based on the User-Agent header isn't part of the Accept* content negotiation design. As for it being functional, I think it's dangerous for the adoption of standards. To give an example that touches on what I've been working on lately, right now a practice of sites sniffing Firefox and Opera and assuming certain script execution behavior is threatening the convergence of all implementations on one standardized behavior.

> And certainly there are circumstances where
> "Accept:" headers are used and are useful.... otherwise people wouldn't
> send them, right?

Mozilla/Netscape added non-*/* Accept header when it was fashionable to do what the standards said was Right without being as critical about believing standards as implementors (rightly) are these days. Once varying text/html vs. application/xhtml+xml became just popular enough that the move couldn't be reverted, WebKit and Opera had to put application/xhtml+xml in their Accept header, too.

The Accept header of Firefox was pruned a bit lately, but parts of it remain in order to cater to legacy. It doesn't follow that the Accept header would be "useful" if redesigned with the knowledge now available.

Also note that the Accept header of IE8 doesn't really allow negotiation on any other practical axis except progressive JPEG vs. not progressive, which no one cares about anymore.

>> Negotiation of HTML vs. XHTML happens but is rare in the
>> big picture and rarely offers true value to users.
> 
> I'm not sure "rarely offers true value" is a motivation for
> making changes, 

I wasn't suggesting making changes here. This was more of a "lessons learned" kind of comment: What didn't really work out as envisioned and in hindsight wouldn't have been worth it.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Tuesday, 9 November 2010 15:11:58 UTC