- From: Noah Mendelsohn <nrm@arcanedomain.com>
- Date: Tue, 30 Nov 2010 21:58:48 -0500
- To: Henri Sivonen <hsivonen@iki.fi>
- CC: Larry Masinter <LMM@acm.org>, "www-tag@w3.org WG" <www-tag@w3.org>
On 11/8/2010 7:58 AM, Henri Sivonen wrote: > The document doesn't sufficiently acknowledge that for most binary file > formats (particularly image files), the "magic number" of the file > format is a much more reliable indicator of the format than an > out-of-band MIME type, so an architecture that insists on using > out-of-band type data and on the out-of-band type data being > authoritative has largely been unproductive, since the less reliable > indicator was supposed to be authoritative. Maybe, but there's a downside of relying on magic numbers that you don't discuss, and I think it's an important one. Unless the designs of all binary formats are coordinated to ensure that no two documents written in different formats will ever have the same "magic bytes", then the system is in principle not robust. As an example from the world of text, consider the sniffing that's done by at least some versions of IE on text/plain. If you serve it the perfectly valid text file with the content: <?xml version="1.0"?> <animals> <dog>Rufus</fish> <cat>Kitty</elephant> </animals> IE will not display it (note the intentionally mismatched tags on the 3rd line). Why would something like this come up? Let's say you had a bug database, with links to the text of documents that caused problems. This buggy XML, served as text/plain, should render. It doesn't, at least in IE. In a sense, that <?xml is a magic number, and the wrong thing is happening. IE thinks "surely it's XML"; the bug tracking app knows "on the contrary, I'm serving it as text precisely because I know it's not". The point is not that this one slightly contrived example breaks; it's that anyone building an app serving text/plain must know the union of all content that would trigger sniffing, and must ensure that no values of its stored data can ever cause such content to be served. That limitation adds both architectural and practical complexity to the system. The system is fragile. Relying on magic numbers for binary formats, except in cases where all the formats are designed together to use a common magic number fields, with values known to be disjoint, has similar characteristics to the text example. Keeping the numbers disjoint likely involves a registry, which puts us back nearly where we started. It also, by the way, makes it much harder to deploy legacy formats using HTTP. Using Content-type avoids all of this content-dependent complexity and fragility. Noah
Received on Wednesday, 1 December 2010 02:59:19 UTC