Re: review of content type rules by IETF/HTTP community from Ian Hickson on 2007-08-23 (public-html@w3.org from August 2007)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 23 Aug 2007 21:12:36 +0000 (UTC)
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: public-html@w3.org
Message-ID: <Pine.LNX.4.64.0708232030530.22616@dhalsim.dreamhost.com>
On Wed, 22 Aug 2007, Roy T. Fielding wrote:
> On Aug 21, 2007, at 11:16 PM, Ian Hickson wrote:
> > On Tue, 21 Aug 2007, Roy T. Fielding wrote:
> > > > 
> > > > I entirely agree with the above and in fact with most of you wrote,
> > > > but just for the record, could you list some of the Web-based
> > > > functionality you mention that depends on Content-Type to control
> > > > behaviour?
> > > 
> > > Mostly intermediaries, spiders, and the analysis engines that perform
> > > operations on the results of both.
> > 
> > Could you name any specific intermediaries, spiders, and analysis engines
> > that honour Content-Type headers with no sniffing? I would really like to
> > study some of these in more detail.
> 
> MOMspider

MOMspider only looks at HTML files. For pages that have Content-Type 
headers that aren't unknown/unknown or application/unknown, it does the 
same as the spec (except that it doesn't look for feeds). For pages that 
_don't_ have Content-Type information, MOMspider could benefit greatly 
from following the HTML5 spec, as currently it uses a heuristic based on 
the file extension rather than checking the file's content (the latter 
being a far more reliable indicator of file type).


> W3CRobot

W3CRobot has a quite extensive content-type sniffing algorithm; for 
security (to prevent different tools from sniffing content as different 
types) it would be good if different tools that did this all used the same 
algorithm, such as the one in the HTML5 draft. (I couldn't actually work 
out what part of W3CRobot consumed the content, and therefore couldn't 
actually verify that it really does honour the MIME type any more than 
browsers do. Can you give me a pointer?)


> several hundred scripts based on libwww-perl or LWP

I use these myself, and my experience is that unless you go out of your 
way to check MIME types, which most authors do not, they do not honour 
Content-Types. Do you have any examples of tools that use these that do 
honour Content-Type headers?


> PhpDig

This, again, could benefit from the algorithm in the spec as it has a 
simple default of HTML when the header is absent. In general, though, this 
code actually follows the HTML5 spec pretty closely; the only things it 
doesn't do are checking that the text data it received isn't actually 
binary, and trying to detect Atom or RSS feeds when it thinks it has HTML, 
neither of which really matter for this spider.


> and probably others listed at
> 
>   http://en.wikipedia.org/wiki/Web_crawler

It's not clear to me that it is probable that they follow the Content-Type 
headers.


> I don't use much intermediary code, but you can see the features on 
> something like
> 
>   http://dansguardian.org/?page=introduction
> 
> is pretty standard.

This kind of software, if it actually honours MIME types, would just be 
trivially bypassed. It seems unlikely that authors of this kind of 
software would honour MIME types instead of sniffing. Unfortunately the 
demo server was not functioning when I tried to use it.


> > There is currently *no way*, given an actual MIME type, for the 
> > algorithm in the HTML5 spec to sniff content not labelled explicitly 
> > as HTML to be treated as HTML. The only ways for the algorithms in the 
> > spec to detect a document as HTML is if it has no Content-Type header 
> > at all, or if it has a header whose value is unknown/unknown or 
> > application/unknown.
> 
> Not even <embed src="myscript.txt" type="text/html">?

<embed> does no sniffing whatsoever, it just honours the type="" attribute 
instead of the Content-Type header. However, <embed>ing an HTML file is 
unlikely to work since you are unlikely to have a plugin configured to 
read HTML files.


> I suggest restructuring the paragraphs into some sort of decision table 
> or structured diagram, since all the "goto section" bits make it 
> difficult to understand.

Yeah, this will probably be rewritten to be easier to read in due course. 
The sections are referred to from other parts of the spec, though, so it's 
not just a matter of making it one section.


> > Sadly, it is. Authors rely on UAs handling the URIs in <img> elements 
> > as images regardless of Content-Type and HTTP response codes. Authors 
> > rely on <script> elements parsing their resources as scripts 
> > regardless of the type of the remote resource. And so forth. These are 
> > behaviours that are tightly integrated with the markup language.
> 
> They don't rely on them -- they are simply not aware of the error.

Ok, let's phrase it this way then. Users rely on UAs handling the URIs in 
<img> elements as images regardless of Content-Type and HTTP response 
codes, so that pages they visit that are errorneous still render usefully.


> > Furthermore, to obtain interoperable *and secure* behaviour when 
> > navigating across browsing contexts, be they top-level pages (windows 
> > or tabs), or frames, iframes, or HTML <object> elements, we have to 
> > define how browsers are to handle navigation of content for those 
> > cases.
> 
> Yes, but why can't that definition be in terms of the end-result of type 
> determination?  Are we talking about a procedure for sniffing in which 
> context is a necessary parameter, or just a procedure for handling the 
> results of sniffing per context.

I don't understand the question. Could you elaborate? How does the spec 
not define navigation (4.6. Navigating across documents) in terms of type 
determination (4.7. Determining the type of a new resource in a browsing 
context)?


> > > Orthogonal standards deserve orthogonal specs.  Why don't you put it 
> > > in a specification that is specifically aimed at Web Browsers and 
> > > Content Filters?
> > 
> > The entire section in question is entitled "Web browsers". Browsers 
> > are one of the most important conformance classes that HTML5 targets 
> > (the other most important one being authors). We would be remiss if we 
> > didn't define how browsers should work!
> 
> Everything does not need to be defined in the same document.

If you just want the content to be in a different file, you could use the 
multipage version of the spec:

   http://www.whatwg.org/specs/web-apps/current-work/multipage/section-content-type-sniffing.html#nav-bar

...but I don't see how that really changes anything. We can't really split 
it into independent documents, since the content is all interrelated. (We 
learnt with DOM2 HTML and HTML4 how it was a mistake to split the related 
parts into separate specs -- you end up with things falling between the 
cracks as spec writers define their scope in ways that don't quite line up 
seamlessly.)


> > > I agree that a single sniffing algorithm would help, eventually, but 
> > > it still needs to be clear that overriding the server-supplied type 
> > > is an error by somebody, somewhere, and not part of "normal" Web 
> > > browsing.
> > 
> > Of course it is. Who said otherwise?
> 
> Where is the error handling specified for sniffed-type != media-type?

The error _handling_, from a UA perspective, is defined in "4.7. 
Determining the type of a new resource in a browsing context". If, on the 
other hand, you are asking where it says that it is an error for the 
author in the first place, then the answer is presumably in the HTTP 
specification, though I actually couldn't find any MUST requirements there 
saying that the given Content-Type must match the actual type of the 
content. As far as _HTML_ goes, the HTML5 spec says:

# HTML documents, if they are served over the wire (e.g. by HTTP) must be 
# labelled with the text/html MIME type.

...in the "1.3. Conformance requirements" section.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 23 August 2007 21:13:03 UTC