- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Thu, 23 Aug 2007 17:05:35 -0700
- To: Ian Hickson <ian@hixie.ch>
- Cc: public-html@w3.org
On Aug 23, 2007, at 2:12 PM, Ian Hickson wrote: > MOMspider only looks at HTML files. For pages that have Content-Type > headers that aren't unknown/unknown or application/unknown, it does > the > same as the spec (except that it doesn't look for feeds). For pages > that > _don't_ have Content-Type information, MOMspider could benefit greatly > from following the HTML5 spec, as currently it uses a heuristic > based on > the file extension rather than checking the file's content (the latter > being a far more reliable indicator of file type). I think you missed the point. MOMspider uses a variety of mechanisms to trim its traversal space to only those resources for which the type is known to be hypertext and understood by its parser. One of those mechanisms is the result of a HEAD request that tells it, among other things, the Content-Type. If the Content-Type indicates text/plain, MOMspider will never see the content and thus never be able to sniff. It can't do otherwise without causing all of its other checks to use GET, which would create an unacceptable bandwidth and load issue on tested servers and eventually lead to it being banned. MOMspider is over twelve years old at this point, but I am sure that the same types of behavior are present in today's link checkers. Therefore, MOMspider (and its ilk) are effected by the accuracy of content-type headers on the Web. That was never an issue until MSIE added sniffing without reporting errors, after which the mismatch errors got steadily worse as the older browsers got replaced. There is no way to compensate for this problem by causing all clients to use the same sniffing algorithm -- some clients never see the content, on purpose. The solution is to require that compliant sniffing be combined with compliant error reporting. It is not a perfect solution, but it will at least give us a chance to reintroduce feedback in the loop. >>> There is currently *no way*, given an actual MIME type, for the >>> algorithm in the HTML5 spec to sniff content not labelled explicitly >>> as HTML to be treated as HTML. The only ways for the algorithms >>> in the >>> spec to detect a document as HTML is if it has no Content-Type >>> header >>> at all, or if it has a header whose value is unknown/unknown or >>> application/unknown. >> >> Not even <embed src="myscript.txt" type="text/html">? > > <embed> does no sniffing whatsoever, it just honours the type="" > attribute > instead of the Content-Type header. However, <embed>ing an HTML > file is > unlikely to work since you are unlikely to have a plugin configured to > read HTML files. That assumes an awful lot about a specific implementation of embed. The spec seems to imply an implementation could have built-in support, and "typically non-HTML" would lead me to believe that HTML is allowed. Combine that with the big red box about sniffing. http://www.whatwg.org/specs/web-apps/current-work/#embed Maybe it is just my imagination, but the on-list discussion seemed to be leaning toward adding more sniffing to HTML5 for embed and object, not less. My preference would be for the box to be prevented from activating unless the types match or the user overrides, even if that preference is only available under a non-default test/security mode. >> I suggest restructuring the paragraphs into some sort of decision >> table >> or structured diagram, since all the "goto section" bits make it >> difficult to understand. > > Yeah, this will probably be rewritten to be easier to read in due > course. > The sections are referred to from other parts of the spec, though, > so it's > not just a matter of making it one section. > > >>> Sadly, it is. Authors rely on UAs handling the URIs in <img> >>> elements >>> as images regardless of Content-Type and HTTP response codes. >>> Authors >>> rely on <script> elements parsing their resources as scripts >>> regardless of the type of the remote resource. And so forth. >>> These are >>> behaviours that are tightly integrated with the markup language. >> >> They don't rely on them -- they are simply not aware of the error. > > Ok, let's phrase it this way then. Users rely on UAs handling the > URIs in > <img> elements as images regardless of Content-Type and HTTP response > codes, so that pages they visit that are errorneous still render > usefully. Why? Users don't rely on that -- browser vendors do because they'd rather whitewash errors than deal with questions. >>> Furthermore, to obtain interoperable *and secure* behaviour when >>> navigating across browsing contexts, be they top-level pages >>> (windows >>> or tabs), or frames, iframes, or HTML <object> elements, we have to >>> define how browsers are to handle navigation of content for those >>> cases. >> >> Yes, but why can't that definition be in terms of the end-result >> of type >> determination? Are we talking about a procedure for sniffing in >> which >> context is a necessary parameter, or just a procedure for handling >> the >> results of sniffing per context. > > I don't understand the question. Could you elaborate? How does the > spec > not define navigation (4.6. Navigating across documents) in terms > of type > determination (4.7. Determining the type of a new resource in a > browsing > context)? I mean: can the algorithm be specified without every single use of that algorithm being aware of its internal details? Specs are just another form of programming. The procedure currently has several entry points and a dozen exit points, and I am asking whether a) the sniffing procedure needs to be aware of the context to determine the sniffed type; or, b) the sniffing procedure is the same for all contexts, but how the result of sniffing is used/discarded changes by context. If it is the former, then defining the procedure with a context parameter makes sense (although it would be a lot easier to read if each context value was dealt with individually as an outer case). If it is the latter, then the context should only be discussed where the result is used, not within the sniffing algorithm. That would simplify the algorithm and place discussion about when the result is used (or reported as an error) back in the sections on the individual elements/actions that might sniff. >>>> Orthogonal standards deserve orthogonal specs. Why don't you >>>> put it >>>> in a specification that is specifically aimed at Web Browsers and >>>> Content Filters? >>> >>> The entire section in question is entitled "Web browsers". Browsers >>> are one of the most important conformance classes that HTML5 targets >>> (the other most important one being authors). We would be remiss >>> if we >>> didn't define how browsers should work! >> >> Everything does not need to be defined in the same document. > > If you just want the content to be in a different file, you could > use the > multipage version of the spec: > > http://www.whatwg.org/specs/web-apps/current-work/multipage/ > section-content-type-sniffing.html#nav-bar > > ...but I don't see how that really changes anything. We can't > really split > it into independent documents, since the content is all > interrelated. (We > learnt with DOM2 HTML and HTML4 how it was a mistake to split the > related > parts into separate specs -- you end up with things falling between > the > cracks as spec writers define their scope in ways that don't quite > line up > seamlessly.) Well, that is an editorial issue. The reason for placing it in different specs is so that implementations of HTML-generating applications would not have to read it. YMMV. Personally, I would prefer that HTML be defined according to its ideal data definition (as if the world were a perfect place and all generators produced exactly what we want to parse) and then separately define a data transformation algorithm that, given any tag soup, will consistently transform it to a valid HTML instance. Such a thing is far easier to read, and test, than a specification that tries to enshrine all of the special-case legacy handling while at the same time defining the data model. Again, YMMV. >>>> I agree that a single sniffing algorithm would help, eventually, >>>> but >>>> it still needs to be clear that overriding the server-supplied type >>>> is an error by somebody, somewhere, and not part of "normal" Web >>>> browsing. >>> >>> Of course it is. Who said otherwise? >> >> Where is the error handling specified for sniffed-type != media-type? > > The error _handling_, from a UA perspective, is defined in "4.7. > Determining the type of a new resource in a browsing context". If, > on the > other hand, you are asking where it says that it is an error for the > author in the first place, then the answer is presumably in the HTTP > specification, though I actually couldn't find any MUST > requirements there > saying that the given Content-Type must match the actual type of the > content. As far as _HTML_ goes, the HTML5 spec says: > > # HTML documents, if they are served over the wire (e.g. by HTTP) > must be > # labelled with the text/html MIME type. > > ...in the "1.3. Conformance requirements" section. I mean that it is missing: 4.7.6 When sniffed type disagrees with Content-Type metadata If Content-Type metadata is present but differs from the sniffed type, then this discrepancy SHOULD be reported to the user as a content error unless such reporting has been turned off by configuration. [... perhaps also disable script handling within the context of such a discrepancy ...] ....Roy
Received on Friday, 24 August 2007 00:05:44 UTC