Re: Revisiting Authoritative Metadata (was: The failure of Appendix C as a transition technique) from Noah Mendelsohn on 2013-03-04 (www-tag@w3.org from March 2013)

From: Noah Mendelsohn <nrm@arcanedomain.com>
Date: Mon, 04 Mar 2013 09:38:24 -0500
To: Anne van Kesteren <annevk@annevk.nl>
CC: Bjoern Hoehrmann <derhoermi@gmx.net>, Robin Berjon <robin@w3.org>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <5134B1E0.2060906@arcanedomain.com>

On 3/4/2013 5:32 AM, Anne van Kesteren wrote:
> For new formats though such as WebVTT sniffing
> for a file identifier seems to become the norm as a) it's much easier
> to develop for and b) it's at least as robust as Content-Type.

Why is this an either/or? I think the right way to approach this is:

* For certain families of content such as the ones you discuss, agreement 
can be reached on disjoint in-band markers, typically at the start of the 
streams, that make the format self-identifying within the family.

* Even for these formats, it's appropriate to have an authoritative 
Content-type >identifying the family<. XML is a good example of this: 
application/xml identifies the family; you can determine from the root node 
to figure out the particular XML document type (application/blah+xml is 
possible but optional).

* For other sorts of data formats, such in-band marking is either 
impossible or a bad tradeoff. Few of us would wish to put at the start of 
each of our C source files "CPRG" or some such, and it would be incoherent 
to do it in comma separated variables (CSV), etc. This is not just a legacy 
issue. There are formats for which in-band markers are a bad tradeoff.

* For these formats Content-type is >necessary< to reliably convey the 
intended interpretation.

* Postel's Law is to be conservative in what you send, as well as liberal 
in what you consume. Sending a jpeg as text/plain is a bug. Period. 
Rendering it as an image in the browser may be the lesser of the evils 
given widespread buggy content, but doing so should be viewed as an 
accommodation. Given that sniffing is to be done, I have no problem with 
the efforts of the HTML5 community to standardize the rules.

Given all of the above, I believe that:

I. Where practical, it may be desirable to coordinate disjoint in-band 
format labeling across a wide range of content. However, we should not 
assume this will always be practical, or that different "families" may not 
have conflicting uses of the same markers.

II. Content-type should remain authoritative, and should be used as 
described above to signal the correct interpretation of content. In cases 
where families of content share disjoint format markers in-band, the 
Content-type can identify the family or the particular format.

III. Serving content with an incorrect Content-type should be viewed as a 
significant violation of the specifications of the Web. Where the type is 
not known to the server, Content-type should not be specified.

IV. Interpretation of content in a manner contrary to the authoritative 
Content-type should be avoided where possible. When necessary to 
accommodate legacy content, as is the case with text/plain today, such 
"sniffing" should be viewed as an ugly work-around to meet practical needs. 
To the extent practical, we should move away from such usage.

I therefore strongly disagree with Robin's proposal, which is to deprecate 
the notion of authoritative Content-types. I have no problem endorsing 
(I.), which I think is in the spirit of where he wants to go.

Noah

BTW: I wonder whether the time has come for a 
text/yes-its-really-plain-text media type?

Noah

Received on Monday, 4 March 2013 14:39:01 UTC