Re: review of content type rules by IETF/HTTP community from Ian Hickson on 2007-08-24 (public-html@w3.org from August 2007)

From: Ian Hickson <ian@hixie.ch>
Date: Fri, 24 Aug 2007 09:26:47 +0000 (UTC)
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: public-html@w3.org
Message-ID: <Pine.LNX.4.64.0708240835300.7485@dhalsim.dreamhost.com>
On Thu, 23 Aug 2007, Roy T. Fielding wrote:
> On Aug 23, 2007, at 2:12 PM, Ian Hickson wrote:
> >
> > MOMspider only looks at HTML files. For pages that have Content-Type 
> > headers that aren't unknown/unknown or application/unknown, it does 
> > the same as the spec (except that it doesn't look for feeds). For 
> > pages that _don't_ have Content-Type information, MOMspider could 
> > benefit greatly from following the HTML5 spec, as currently it uses a 
> > heuristic based on the file extension rather than checking the file's 
> > content (the latter being a far more reliable indicator of file type).
> 
> I think you missed the point.  MOMspider uses a variety of mechanisms to 
> trim its traversal space to only those resources for which the type is 
> known to be hypertext and understood by its parser.  One of those 
> mechanisms is the result of a HEAD request that tells it, among other 
> things, the Content-Type.  If the Content-Type indicates text/plain, 
> MOMspider will never see the content and thus never be able to sniff.

Which is fine (and per spec), since text/plain will never get sniffed as 
HTML by the HTML5 algorithm.


> It can't do otherwise without causing all of its other checks to use 
> GET, which would create an unacceptable bandwidth and load issue on 
> tested servers and eventually lead to it being banned.

No, this is not true. To change from what it does now to being 100% 
compatible with what the HTML5 spec does, it just needs to change its 
current fallback code in the following ways:

 * Treat "unknown/unknown" and "application/unknown" types the same as the 
   lack of a Content-Type header.

 * Instead of sniffing the content type from the filename when there's no 
   Content-Type header, use a GET request and examine the first few bytes 
   of the file.

This would not cause an "unacceptable bandwidth and load issue", it would 
in fact hardly increase the load at all.


> MOMspider is over twelve years old at this point, but I am sure that the 
> same types of behavior are present in today's link checkers.

Conformance checkers that want to catch errors in Content-Type headers are 
going to have to do a lot more sniffing than the spec requires, 
ironically. Link checkers, though, including MOMspider if that's what it's 
used for, can assume the MIME types are conformant (especially if used as 
part of a validation pipeline) and don't need to do any sniffing. There's 
a distinction between an end-user tool or user agent working on arbitrary 
content from other authors, and a development or authoring tool used by a 
web designer to create or test his sites.


> Therefore, MOMspider (and its ilk) are effected by the accuracy of
> content-type headers on the Web.  That was never an issue until MSIE
> added sniffing without reporting errors, after which the mismatch
> errors got steadily worse as the older browsers got replaced.

MSIE was by far not the first browser to do sniffing, though they did add 
more heuristics.


> The solution is to require that compliant sniffing be combined with 
> compliant error reporting.  It is not a perfect solution, but it will at 
> least give us a chance to reintroduce feedback in the loop.

Telling the end user that the content is broken will not do anything. 
Users wouldn't understand why the UA kept saying it, especially since it 
would say it for most pages on the Web. Making the errors only appear in 
error consoles is already done in many cases, and could be done in more, 
but that's a UA issue, not an interoperability issue, and thus out of 
scope for a specification. (Mozilla already reports Content-Type errors 
for stylesheets, but nobody cares.)


> > > Not even <embed src="myscript.txt" type="text/html">?
> > 
> > <embed> does no sniffing whatsoever, it just honours the type="" 
> > attribute instead of the Content-Type header. However, <embed>ing an 
> > HTML file is unlikely to work since you are unlikely to have a plugin 
> > configured to read HTML files.
> 
> That assumes an awful lot about a specific implementation of embed. The 
> spec seems to imply an implementation could have built-in support, and 
> "typically non-HTML" would lead me to believe that HTML is allowed.

Yeah, maybe we should just disallow HTML.


> Combine that with the big red box about sniffing.
> 
>  http://www.whatwg.org/specs/web-apps/current-work/#embed
> 
> Maybe it is just my imagination, but the on-list discussion seemed to be 
> leaning toward adding more sniffing to HTML5 for embed and object, not 
> less.

<object> has some rather constrained requirements due to legacy issues. I 
don't think <embed> has any though; as far as I can tell, the type="" 
attribute always overrides the server, and the server otherwise is always 
honoured. I could be wrong though, which is what the aforementioned red 
box is about.


> My preference would be for the box to be prevented from activating 
> unless the types match or the user overrides, even if that preference is 
> only available under a non-default test/security mode.

User agents typically don't find "[x] Make my browser less able to 
usefully render Web pages" a popular preference, and so they rarely find 
it a useful use of their (very) limited resources to implement and test 
such a feature.

But, I encourage you to convince the browser vendors to do this. If they 
agree, maybe we can add it to the spec after all.


> > Ok, let's phrase it this way then. Users rely on UAs handling the URIs 
> > in <img> elements as images regardless of Content-Type and HTTP 
> > response codes, so that pages they visit that are errorneous still 
> > render usefully.
> 
> Why?  Users don't rely on that -- browser vendors do because they'd 
> rather whitewash errors than deal with questions.

What questions? Users aren't going to ask their browser vendor why the new 
version of their browser doesn't render the pages they visit correctly 
anymore -- they're just going to use another browser, which does.


> I mean: can the algorithm be specified without every single use of that 
> algorithm being aware of its internal details?  Specs are just another 
> form of programming.  The procedure currently has several entry points 
> and a dozen exit points, and I am asking whether
> 
>  a) the sniffing procedure needs to be aware of the context to
>     determine the sniffed type; or,
> 
>  b) the sniffing procedure is the same for all contexts, but how
>     the result of sniffing is used/discarded changes by context.

Your question implies that parts of the spec are in fact aware of the 
internal details of the sniffing, but I can't think of which parts that 
could be. Could you give me some pointers? I don't really follow.


> The reason for placing it in different specs is so that implementations 
> of HTML-generating applications would not have to read it.  YMMV.

HTML-generating applications already don't have to read huge chunks of the 
spec. I don't think is a especially interesting section to extract.


> Personally, I would prefer that HTML be defined according to its ideal 
> data definition (as if the world were a perfect place and all generators 
> produced exactly what we want to parse) and then separately define a 
> data transformation algorithm that, given any tag soup, will 
> consistently transform it to a valid HTML instance. Such a thing is far 
> easier to read, and test, than a specification that tries to enshrine 
> all of the special-case legacy handling while at the same time defining 
> the data model.  Again, YMMV.

My milage does indeed vary. :-)


> I mean that it is missing:
> 
>   4.7.6  When sniffed type disagrees with Content-Type metadata
> 
>   If Content-Type metadata is present but differs from the sniffed
>   type, then this discrepancy SHOULD be reported to the user as a
>   content error unless such reporting has been turned off by
>   configuration.  [... perhaps also disable script handling within
>   the context of such a discrepancy ...]

Such user interface issues are out of scope of a specification, IMHO. They 
are not required for interoperability.

Furthermore, that requirement would be ignored. If you can get a browser 
vendor to actually implement the above in a way you consider acceptable, 
then I'd consider putting it in the spec -- but there's no point putting 
something in which every single browser vendor has repeatedly told me and 
others that they would never do.

Browser vendors don't implement specs they disagree with. As spec authors, 
we only have power over user agent implementors so long as we tell them to 
do things they want to do anyway.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 24 August 2007 09:27:15 UTC