- From: Ian Hickson <ian@hixie.ch>
- Date: Thu, 23 Aug 2007 21:12:36 +0000 (UTC)
- To: "Roy T. Fielding" <fielding@gbiv.com>
- Cc: public-html@w3.org
On Wed, 22 Aug 2007, Roy T. Fielding wrote: > On Aug 21, 2007, at 11:16 PM, Ian Hickson wrote: > > On Tue, 21 Aug 2007, Roy T. Fielding wrote: > > > > > > > > I entirely agree with the above and in fact with most of you wrote, > > > > but just for the record, could you list some of the Web-based > > > > functionality you mention that depends on Content-Type to control > > > > behaviour? > > > > > > Mostly intermediaries, spiders, and the analysis engines that perform > > > operations on the results of both. > > > > Could you name any specific intermediaries, spiders, and analysis engines > > that honour Content-Type headers with no sniffing? I would really like to > > study some of these in more detail. > > MOMspider MOMspider only looks at HTML files. For pages that have Content-Type headers that aren't unknown/unknown or application/unknown, it does the same as the spec (except that it doesn't look for feeds). For pages that _don't_ have Content-Type information, MOMspider could benefit greatly from following the HTML5 spec, as currently it uses a heuristic based on the file extension rather than checking the file's content (the latter being a far more reliable indicator of file type). > W3CRobot W3CRobot has a quite extensive content-type sniffing algorithm; for security (to prevent different tools from sniffing content as different types) it would be good if different tools that did this all used the same algorithm, such as the one in the HTML5 draft. (I couldn't actually work out what part of W3CRobot consumed the content, and therefore couldn't actually verify that it really does honour the MIME type any more than browsers do. Can you give me a pointer?) > several hundred scripts based on libwww-perl or LWP I use these myself, and my experience is that unless you go out of your way to check MIME types, which most authors do not, they do not honour Content-Types. Do you have any examples of tools that use these that do honour Content-Type headers? > PhpDig This, again, could benefit from the algorithm in the spec as it has a simple default of HTML when the header is absent. In general, though, this code actually follows the HTML5 spec pretty closely; the only things it doesn't do are checking that the text data it received isn't actually binary, and trying to detect Atom or RSS feeds when it thinks it has HTML, neither of which really matter for this spider. > and probably others listed at > > http://en.wikipedia.org/wiki/Web_crawler It's not clear to me that it is probable that they follow the Content-Type headers. > I don't use much intermediary code, but you can see the features on > something like > > http://dansguardian.org/?page=introduction > > is pretty standard. This kind of software, if it actually honours MIME types, would just be trivially bypassed. It seems unlikely that authors of this kind of software would honour MIME types instead of sniffing. Unfortunately the demo server was not functioning when I tried to use it. > > There is currently *no way*, given an actual MIME type, for the > > algorithm in the HTML5 spec to sniff content not labelled explicitly > > as HTML to be treated as HTML. The only ways for the algorithms in the > > spec to detect a document as HTML is if it has no Content-Type header > > at all, or if it has a header whose value is unknown/unknown or > > application/unknown. > > Not even <embed src="myscript.txt" type="text/html">? <embed> does no sniffing whatsoever, it just honours the type="" attribute instead of the Content-Type header. However, <embed>ing an HTML file is unlikely to work since you are unlikely to have a plugin configured to read HTML files. > I suggest restructuring the paragraphs into some sort of decision table > or structured diagram, since all the "goto section" bits make it > difficult to understand. Yeah, this will probably be rewritten to be easier to read in due course. The sections are referred to from other parts of the spec, though, so it's not just a matter of making it one section. > > Sadly, it is. Authors rely on UAs handling the URIs in <img> elements > > as images regardless of Content-Type and HTTP response codes. Authors > > rely on <script> elements parsing their resources as scripts > > regardless of the type of the remote resource. And so forth. These are > > behaviours that are tightly integrated with the markup language. > > They don't rely on them -- they are simply not aware of the error. Ok, let's phrase it this way then. Users rely on UAs handling the URIs in <img> elements as images regardless of Content-Type and HTTP response codes, so that pages they visit that are errorneous still render usefully. > > Furthermore, to obtain interoperable *and secure* behaviour when > > navigating across browsing contexts, be they top-level pages (windows > > or tabs), or frames, iframes, or HTML <object> elements, we have to > > define how browsers are to handle navigation of content for those > > cases. > > Yes, but why can't that definition be in terms of the end-result of type > determination? Are we talking about a procedure for sniffing in which > context is a necessary parameter, or just a procedure for handling the > results of sniffing per context. I don't understand the question. Could you elaborate? How does the spec not define navigation (4.6. Navigating across documents) in terms of type determination (4.7. Determining the type of a new resource in a browsing context)? > > > Orthogonal standards deserve orthogonal specs. Why don't you put it > > > in a specification that is specifically aimed at Web Browsers and > > > Content Filters? > > > > The entire section in question is entitled "Web browsers". Browsers > > are one of the most important conformance classes that HTML5 targets > > (the other most important one being authors). We would be remiss if we > > didn't define how browsers should work! > > Everything does not need to be defined in the same document. If you just want the content to be in a different file, you could use the multipage version of the spec: http://www.whatwg.org/specs/web-apps/current-work/multipage/section-content-type-sniffing.html#nav-bar ...but I don't see how that really changes anything. We can't really split it into independent documents, since the content is all interrelated. (We learnt with DOM2 HTML and HTML4 how it was a mistake to split the related parts into separate specs -- you end up with things falling between the cracks as spec writers define their scope in ways that don't quite line up seamlessly.) > > > I agree that a single sniffing algorithm would help, eventually, but > > > it still needs to be clear that overriding the server-supplied type > > > is an error by somebody, somewhere, and not part of "normal" Web > > > browsing. > > > > Of course it is. Who said otherwise? > > Where is the error handling specified for sniffed-type != media-type? The error _handling_, from a UA perspective, is defined in "4.7. Determining the type of a new resource in a browsing context". If, on the other hand, you are asking where it says that it is an error for the author in the first place, then the answer is presumably in the HTTP specification, though I actually couldn't find any MUST requirements there saying that the given Content-Type must match the actual type of the content. As far as _HTML_ goes, the HTML5 spec says: # HTML documents, if they are served over the wire (e.g. by HTTP) must be # labelled with the text/html MIME type. ...in the "1.3. Conformance requirements" section. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 23 August 2007 21:13:03 UTC