Re: Approved TAG finding: Authoritative Metadata from noah_mendelsohn@us.ibm.com on 2006-08-08 (www-tag@w3.org from August 2006)

From: <noah_mendelsohn@us.ibm.com>
Date: Tue, 8 Aug 2006 10:41:43 -0400
To: "Anne van Kesteren" <annevk@opera.com>
Cc: www-tag@w3.org, Ian Hickson <ian@hixie.ch>
Message-ID: <OF18F42B46.9FCF25DF-ON852571C4.004B5E6E-852571C4.0050BA93@lotus.com>
This thread has referenced the very interesting blog entry at [1], and 
makes the case that the TAG is off base in pushing the Web community [2] 
quite strongly to give precedence to the HTTP Content-Type header over 
content sniffing, keying on the URI suffix, etc.  I think it's fair 
concern to raise:  there's a tremendous amount of content being served 
with bogus content types, at the moment browsers are getting good mileage 
out of doing what their users need, etc.  The claim is that the TAG is 
therefore being unrealistic in promulgating a finding that ignores these 
"facts on the ground".

Writing for myself and not formally for the TAG. let's look at the 
consequences of taking this seemingly more "practical" approach.  Taken to 
it's logical conclusion, this line of reasoning seems to yield a Web in 
which content providers have little or no reliable means of labeling the 
intended interpretation of content that they source  -- or more 
specifically, they can label it, but the label is to be either ignored, or 
used only as input to a client-side heuristic.  For this to work, clients 
depend on the actual types (I.e. the set of legal octet sequences in each 
type) being disjoint, in the sense that you'll be able to assign a type to 
any stream by looking at it, or maybe also by looking at the URI from 
which it was sourced.  If two types naturally include the same octet 
sequence, then only one of them will be successfully transmissible to a 
client.  If the start of a text file looks like an XML file, then it's 
XML.  That ultimately puts a constraint on all designers of formats to 
keep their content disjoint, which in principle at least means knowing 
about all the others.   In practice, there are a set of somewhat ad-hoc 
rules that do involve the content type, but in a less clean way than Web 
architecture would suggest.  application/soap+xml is never a movie, but 
text/plain might be;  image/png is a hint that it's graphics, but not 
necessarily PNG.  So, the MIME type is triggering a set of heuristics, but 
the rules for these are determined by evolving common practice rather than 
in any more carefully controlled way. 

Question:  is this a good long term design for the Web?  It seems to mean 
that you can't transmit a bit of text that happens to resemble XML at its 
start, for example.  It's quite fragile in that respect.  What was a nice, 
simple orthogonal text/plain media type (text is any sequence of Unicode 
characters) becomes much more ad hoc (all text sequences except those that 
happen to look like a continually changing set of other things).  So, as 
noted above, you can't design a new format in isolation.  In contrast, the 
finding implies that all you have to do when inventing a new format is to 
find and register a new media type name;  the sniffing approach suggests 
you have to know the format of any other format that might de facto share 
a media type with you.  Furthermore, success in designing a new client 
will be harder to achieve, insofar as it will take a lot of research into 
common practice to figure out just what rules to implement, and debugging 
that new client will presumably be very tricky (as I'm sure vendors of 
modern browsers would attest.)  The finding also discusses security 
issues, and so on. 

Speaking for myself and not formally for the TAG, the above seem to me 
like pretty severe drawbacks in the long run, and I think that the TAG is 
doing its job in warning the Web community of them.  Frankly, I think the 
finding has the balance on the other side just about right too, if you 
read it carefully.  In particular, I think it's quite careful in dealing 
with the practical realities as well as the theoretical ideal.  See 
Chapter 4, and especially section 4.3 [3]:

"4.3 Avoiding silent recovery

"As described above, inconsistency between representation data and 
metadata is an error. However, the tendency for some agents to attempt 
silent recovery from such errors is also an error. Silent recovery from 
error perpetuates what could be easily fixed if the resource owner is 
simply informed of that error during their own testing of the resource.

"Good Practice:  Web agents SHOULD have a configuration option that 
enables the display or logging of detected errors.

">>Revealing errors when they occur need not be disruptive of the user 
experience.<< For example, a graphical browser might display a small "bug" 
button in the user interface to indicate a detected error so that an 
interested user (i.e., the resource owner) can select the button, inspect 
the error, and perhaps modify the agent's choice on how to recover from 
that error. Naturally, the appropriate mechanism will be unique to each 
type of receiving agent and application context.

"Some applications of the Web cannot tolerate error. For example, medical 
information systems must be designed so as to detect errors that might 
cause relevant information to be rendered invisible. In general, it is 
better to design Web systems that are capable of fulfilling more stringent 
requirements, even if their default configuration is to be lenient."

I read this as saying:  "You can break the rules when you really need to, 
just not entirely silently.  That doesn't have to be disruptive to your 
users.  Be aware too that your software may be used in critical situations 
(medical systems) where the consequences of a bad inference may be 
particularly severe, and as necessary provide options to ensure that those 
users can get the strict checking they need."

Crucially, section 4.2 [4] goes to some length to encourage server owners 
to do the right thing, and if they do the pressures to guess on the client 
will gradually be reduced. 

Going back to the blog entry [1] and especially the earlier one at [5], 
there are some aspects I don't quite understand.  On the one hand, it 
quite reasonably says that there are a few specific situations in which 
current practice is so far off from the TAG's finding that we should 
acknowledge that user agents will need to accomodate in practice.  Fair 
enough.  It goes on, though, to suggest that: 

"I think it may be time to retire the Content-Type header, putting to 
sleep the myth that it is in any way authoritative, and instead have 
well-defined content-sniffing rules for Web content."

I'm afraid I just don't get that.  I would think the right answer would 
be:  let's not perpetuate these mistakes as new types spring up on the 
Web.  Let's work hard to get them sourced with proper media types, so that 
we can have a pretty clean Web that scales well, albeit with a few 
historical warts, rather than a free for all in which there's no reliable 
way to establish a new type, or to reliably signal its use from the 
server.  So, the normative rule is:  use Content-Type.  The accomodation 
is:  cheat where already deployed content requires you to.

Again speaking for myself, I think this is a very good finding, and I'm 
mostly unconvinced by the arguments in this thread that it does not 
appropriately balance practical and theoretical concerns.  I don't mean to 
be inflamatory in saying this;  it's just that in weighing the 
architectural consequences of the two paths I think the finding is right 
to push hard for clear server-side labeling of types.  I do suggest that 
all who are concerned about the practicality of this finding carefully 
read Chapter 4 as well the earlier ones, as I think it does a good job of 
striking a balance.  In case it's not clear, I played very little role in 
the drafting of the finding, or else I wouldn't feel free to praise it 
quite so unreservedly.  I do like it a lot.

Noah

[1]http://ln.hixie.ch/?start=1154950069&count=1
[2] http://www.w3.org/2001/tag/doc/mime-respect-20060412
[3] http://www.w3.org/2001/tag/doc/mime-respect-20060412#silent-recovery
[4] 
http://www.w3.org/2001/tag/doc/mime-respect-20060412#reducing-inconsistency
[5] http://ln.hixie.ch/?start=1144794177&count=1


--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Received on Tuesday, 8 August 2006 14:41:56 UTC