RE: Revisiting Authoritative Metadata from Larry Masinter on 2013-03-01 (www-tag@w3.org from March 2013)

From: Larry Masinter <masinter@adobe.com>
Date: Thu, 28 Feb 2013 18:01:48 -0800
To: Robin Berjon <robin@w3.org>
CC: John Kemp <john@jkemp.net>, "Eric J. Bowman" <eric@bisonsystems.net>, Henri Sivonen <hsivonen@iki.fi>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <C68CB012D9182D408CED7B884F441D4D1E880CA0AA@nambxv01a.corp.adobe.com>
> I think that this conflates two issues: authoritative metadata and
> extensible/distributed type identifiers. I'm not suggesting that we
> throw the baby out with the bathwater: it would well be valuable to
> reuse MIME (or something like it) for magic numbers (or similar
> process). That would keep the system's extensibility while doing away
> with the issues with authoritative metadata.

The two issues _are_ completely intertwined and cannot be
separated.   For us to say what it means for metadata to be
"authoritative", we have to define what we think the metadata
means. I think the TAG finding is flawed and could be fixed,
but the fix is not to throw out content-type.

I don't understand how you're proposing to use MIME other than
it's original intent, which is to label content that would otherwise
be ambiguous. 

We can categorize some use cases of ambiguity; 'polyglot' is a special 
case where the ambiguity is expected to not have a semantic
effect, but there are lots of situations where the ambiguity
is significant; where the same content can be "sniffed" different
ways depending on the order of sniffing testing and the priorities
given to each probable outcome.

And there are many well-recorded cases where sniffing has
led to serious security breaches. The "Sniffing" document is not
an attempt to describe sniffing in general, but rather to try
to stop-gap a "single" algorithm. I think that's hopeless, completely
unextensible, and really poor architecture.

Unfortunately, because of the DWIM behavior of HTML parsers,
HTML is ambiguous with almost any text format.


 
> > The fallacy is believing that a given piece of content "is" in a single
> > content-type, when often it is ambiguous.
> 
> Certainly, but that is a problem with identifying typing in general and
> isn't specific to the mechanism used to convey that information.

Could you explain what you mean here? We're talking about
how senders communicate content to receivers in file formats
and negotiation the format.

Now, to limit the ambiguity *does* require browser implementors
to agree to reject content that they might otherwise accept.
This requires a leap of faith on the part of the browser makers,
to agree to actually implement the standard. And for backward
compatibility, no one will agree to break content that currently
works.

So I think it's necessary to tie "reduced sniffing" to some new
feature. I had hoped that could be HTTP 2.0, but 

http://msdn.microsoft.com/en-us/library/ie/gg622941(v=vs.85).aspx

x-content-type-options: nosniff  seems like it's a better idea
since it can be introduced now, is already proven to be implementable.

I've never seen an argument against standardizing "nosniff"
other than 'too much trouble'.


> > In general, polyglot and generic/specific overlaps of content-type is cause
> > for asserting that sniffing alone is broken for most content-types, because
> > the inventors of the content-type have not allowed for any indication of
> > version/specialization (like not having any HTML version type).
> 
> Well, before looking at technical solutions for conveying additional
> information we have to agree on what information is useful. Versioning,
> for instance, I believe isn't:
> 
>      http://berjon.com/blog/2009/12/xmlbp-naive-versioning.html


"That is fine if the purpose is to die immediately when a given version is not supported (in which case simply changing the namespace would be less verbose and just as effective), but will not produce any useful effect if the intent is to allow processors to work across versions."

This is true only if you think the only workflow that matters is 
sender-to-receiver for the content. But there are many other
workflows and use cases. For example, if you're building an
editor or syntax checker or authoring tool, it is useful to
indicate the _intended_ version (or the specification you were
targeting when you provided the content.)

A "version" is just another kind of evolutionary process, a way of
packaging together a set of extensions and changes.

Implementations and source code are versioned all the time, but the
version is not an interface version. Specifications have versions also
(or at least dates and publication series) and
there's a value when you are talking about a spec
to make sure you're talking about the same version.

However, the version of the spec isn't  the version of the
language/interface/api/protocol because the implementations
don't always match the spec. So if you put a version indicator
in some content, you'll get mis-matches.

This is an important deployment problem, and it does reduce
the utility of versions.

MIME avoids trying to identify versions out of band because
it's  fragile. MIME types name a family of file formats.

Perhaps they're not totally authoritative because of difficulties
in deployment, but they *are* important to pay attention to!


Larry
--
http://larry.masinter.net
Received on Friday, 1 March 2013 02:02:26 UTC