Re: Revisiting Authoritative Metadata (was: The failure of Appendix C as a transition technique) from David Sheets on 2013-02-26 (www-tag@w3.org from February 2013)

From: David Sheets <kosmo.zb@gmail.com>
Date: Mon, 25 Feb 2013 20:14:58 -0800
To: Robin Berjon <robin@w3.org>
Cc: Larry Masinter <masinter@adobe.com>, Henri Sivonen <hsivonen@iki.fi>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <CAAWM5TxQS8FxbbEH9Y4HanW0nGUevFNmLdq2ZW_Xd7qBBTKbJQ@mail.gmail.com>
On Mon, Feb 25, 2013 at 4:25 AM, Robin Berjon <robin@w3.org> wrote:
> On 22/02/2013 12:36 , David Sheets wrote:
>>
>> On Fri, Feb 22, 2013 at 1:22 AM, Robin Berjon <robin@w3.org> wrote:
>>>
>>> I would support the TAG revisiting the topic of Authoritative Metadata,
>>> but
>>> with a view on pointing out that it is an architectural antipattern.
>>> Information that is essential and authoritative about the processing of a
>>> payload should be part of the payload and not external to it. Anything
>>> else
>>> is brittle and leads to breakage.
>>
>>
>> HTTP is a text protocol. HTTP messages are of type application/http or
>> message/http. HTTP messages include metadata regarding the properties
>> of the representation of the served resource including the media type
>> of the envelope's contents, the version of the representation, and
>> information about when the message expires.
>
> Right. But that information gets lost when you save the payload. People
> don't go about saving HTTP messages to disk, they save the payload.

I believe you mean end user agents save the payload. It is quite
common for automated consumers to persist transactional metadata.

> What one cares about transmitting most of the time is the payload, not the message.

The message is the standard way to deliver the payload. If you just
want to send the payload, use netcat. Or use HTTP without a
Content-type header.

> The message is just a by-product of the fact that you want to use this protocol.

The information doesn't *have* to be destroyed when you *save*.

>> How is telling the client everything it needs to know about processing
>> and storage external to the message? It's in the message.
>
> But the message isn't the interesting unit of analysis, the payload is.

The message is the fundamental unit of analysis of the protocol. The
payload contains only the requested content but not necessarily data
about the publication of the content.

> Introducing the ability for the message and payload to be out of sync is to
> introduce fragility and an attack vector.

Throwing away important information introduces fragility and an attack vector.

There are many, many cases where protocol and payload can disagree
(e.g. Total Length field of IP headers).

Furthermore, there are known mitigations to both "fragility" and
"attack vector" issues. Is deprecating useful components of HTTP
really the right solution?

>> How is this an antipattern? It's very standard and very unambiguous.
>
> Something can be standard and unambiguous and yet still a bad idea.

Between

"Here's how you describe your content if you want."

and

"Sometimes people lie so don't bother telling the truth. Telling the
truth is deprecated."

It seems, to me, that the second is a bad idea. It is strictly worse
than the present situation which at least gives publishers the
facility to unambiguously indicate their intent.

>>> The sniffing behaviour is a consequence of media types as an
>>> architectural
>>> construct, not an alternative to it.
>>
>>
>> Sniffing is brittle and leads to breakage as included metadata
>> regarding how to interpret the payload is ignored.
>>
>> The sniffing behaviour is a consequence of an attitude of Big Browser
>> Knows Best regarding media types.
>
> No it's not. Sniffing is a direct consequence of authoritative metadata.

No, it's not. Sniffing ability is a direct consequence of the optional
nature of Content-type. General-purpose user agents have to have the
ability to sniff because they may not be presented with ANY metadata.
That this heuristic is also useful when publishers lie to you is not a
reason to silently disregard the sender's intent, "sniff", and present
to an ignorant user. Fundamental interpretation errors and subsequent
heuristic correction should be surfaced. That this is NOT the present
behavior indicates, to me, an attitude of Browser Knows Best.

> It's certainly not something that's limited to browsers. I have written
> plenty of tools over the years that ignore the media type just because you
> have to: it's wrong. RSS shipped as a bewildering array of wrong media types
> or JSON shipped (typically) as text/html are just the more prominent
> examples.

That's fine. Sometimes people lie. When we develop applications that
speak HTTP, we must be aware that the other side may lie. As
developers, we can specifically handle these cases for our application
as we are in full control.

The problem arises with the combination of "I, Browser, know better
than BOTH my user and the transmitting party." and "Because I clearly
know best, neither producer nor consumer should *attempt* to indicate
any intent."

Yes, there are problems with the present system. Your suggestion of
"everyone should just plan for ambiguous sniffing and we should
deprecate declarative intent" appears to be removing publisher choice
without any proposed replacement.

How does it then follow that the recommendation of the W3C should
change from "don't lie" to "it doesn't matter if you lie or tell the
truth"?

> No one uses sniffing because they find it fun. Sniffing is there because
> media types, as a technical and social construct, is inherently brittle.

Once again, correcting incorrect media types is only one application
of these heuristics. Synthesizing a media type is another application.
Just because you have a function

sniff : blob -> media_type

doesn't mean that media_type is worthless and should be globally
replaced by blob. By advocating for global replacement, you are
advocating for the institutionalization of the ambiguous status quo
instead of allowing multiple methods to co-exist with clear indication
of which method is unambiguous.

>> The alternative to this behaviour is respecting the media type as
>> transmitted.
>
> And how does that help anyone when it's wrong?

You can respect the media type as transmitted by telling the user that
you have taken the liberty of interpreting the content in a way that
was not indicated by the producer. This helps both the user and the
publisher understand what has occurred. If the publisher is
transmitting messages with extremely wrong types, this is quite
suspicious activity and everyone involved should know.

>> How is sniffing a consequence of following the protocol?
>
> Two primary aspects contribute to this:
>
> • The information essential to the processing of the payload is made to be
> volatile, such that in many if not most cases it exists only during
> transmission but not before or after. In some cases, it can in fact be
> difficult to keep it (the typical case being maintaining content coding and
> media type while storing in a regular file system). This volatility leads to
> information loss and errors. When two pieces of information can easily go
> out of synch, they will.

I concur. HTTP semantics are a superset of filesystem semantics. This
is a difficult tension to resolve but necessary due to HTTP's more
general use cases. In situations like this it seems prudent to try to
achieve harmony while maintaining semantics rather than cutting HTTP
down to fit into a world of desktop file systems.

At least two mitigations exist:

1. Browsers have access to persistence for metadata.
2. Saving payloads that would be sniffed as a type that they weren't
interpreted as should trigger user notification.

I believe some Operating Systems include warnings about changing file
extensions.

> • The cost of error is born by the receiver, not the sender. In any such
> system you are guaranteed to see receivers perform error correction, and
> those will dominate over time (simply by virtue of being better for their
> users).

That's fine. Error correction is very important.

It doesn't follow that declaring intent should be deemed an
antipattern. It certainly doesn't follow that error correction should
be *assumed*. I believe this argument suffers from the fallacy of
appeal to the market.

>>> Further, I think that the TAG should take this occasion to issue a
>>> recommendation to people building formats that they include format
>>> identifying information as essential, typically with a magic number,
>>> first
>>> non-blank line, etc.
>>
>> What occasion would that be?
>
> The aforementioned revisiting of this issue.

"On the occasion of my raising the issue, the issue should be settled." ?

I add, here, that application/json does not guarantee dictionary
ordering nor supply any higher-level namespace mechanism. How do you
suggest JSON messages be transmitted in this New World? Specifier A
uses application/a+json with a top-level dictionary with magic key,
"a", and an open namespace. Specifier B uses application/b+json with a
top-level dictionary with magic key, "b", and an open namespace.
Should {"a": "1.0.0", "b": "1.0.0", "execute": "..."} be interpreted
as application/a+json or application/b+json?

>> Here's how to can tell you are receiving an HTTP message:
>> <http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6.1>
>
> Yeah, but transmission is just a small part of the data lifecycle.

Yeah, but receipt from a remote publishing authority is the most
salient indicator of intended interpretation. Instead of
(unsuccessfully) convincing the world to adopt consistent and useful
magic numbers for every content type, why not standardize this in a
common protocol that carries any type of content?

It doesn't have to be either/or.

Regards,

David
Received on Tuesday, 26 February 2013 04:15:27 UTC