Feedback on Internet Media Types and the Web

At TPAC, I was assigned an action (http://www.w3.org/2010/11/04-html-wg2-minutes.html#action01) to give feedback on http://tools.ietf.org/id/draft-masinter-mime-web-info-00.html . Here it is:

- -

Section "3.1.  Differences between email and web delivery" doesn't elaborate on the CRLF issue. In email, there's a leaky abstraction of SMTP line break conventions to the payload if the payload hasn't been base64-encoded. For this reason, the text/* subtree, per RFCs, requires CRLF line breaks. However, HTTP works fine with payload using other kinds of line breaks (in particular lone LF or lone CR) and Web-oriented text/* types including text/html, text/css and text/javascript are routinely used with non-CRLF line breaks--particularly LF-only line breaks.

Nothing bad comes out of LF-only line breaks in text/* types over HTTP. Yet, guardians of the MIME registry keep insisting that it's wrong to use text/* with non-CRLF line breaks. This leads to counter-intuitive naming for new types, such as application/relax-ng-compact-syntax, and to unproductive use of effort for trying to get people to use application/* types where a text/* type exists (consider text/xml and text/javascript, the latter of which is marked obsolete in the registry).

- -

Section "4.1.  There are related problems with charsets" doesn't sufficiently rebuke the IETF for the supposed US-ASCII default for text/* types. The de jure rules for text/xml encoding defaults are totally unhelpful (text/xml US-ASCII default supposedly overrides the more sane UTF-8 default). In general, in practice, the default depends on the format and it's pointless to pretend it depended on the text/* tree itself.

- -

The document doesn't sufficiently acknowledge that for most binary file formats (particularly image files), the "magic number" of the file format is a much more reliable indicator of the format than an out-of-band MIME type, so an architecture that insists on using out-of-band type data and on the out-of-band type data being authoritative has largely been unproductive, since the less reliable indicator was supposed to be authoritative. In general, if HTTP didn't have MIME and all Web types including the textual ones had been designed to have mandatory magic numbers (e.g. if "<?xml" were mandatory for XML, HTML files always started with "<html" and CSS files always started with "@charset"), we might be better off using magic numbers exclusively compared to the current universe where we have MIME types.

- -

The document doesn't recount how dysfunctional the MIME type registry has been. image/svg+xml, image/jp2 and video/mp4 would be appropriate to investigate as case studies. image/svg+xml *still* isn't in the registry even though deployment has been going on for a decade. image/jp2 and video/mp4 appeared in the registry only after Apple had shipped QuickTime 6 that assumed these types.

- -

Section "4.5.  Content Negotiation" doesn't properly acknowledge that content negotiation on axes other than lossless compression (gzip) is mostly a failure on the Web. It makes no sense to negotiate the character encoding, because UTF-8 covers everything. Some sites take Accept-Language as a hint of application UI language but don't really perform negotiation strictly per HTTP. Negotiated translations of the document (non-application-oriented) content of sites is relatively rare and people are often better off picking a translation manually. Negotiating the file format e.g. HTML vs. Word vs. PDF doesn't really happen. People want to make an explicit choice of downloading an MS Office or PDF depending on the goals they have that moment instead of letting software pick a format for them. Negotiation of HTML vs. XHTML happens but is rare in the big picture and rarely offers true value to users.

- -

I believe this concludes my action from the meeting.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Monday, 8 November 2010 12:59:07 UTC