Re: review of content type rules by IETF/HTTP community

On Aug 23, 2007, at 2:12 PM, Ian Hickson wrote:
> MOMspider only looks at HTML files. For pages that have Content-Type
> headers that aren't unknown/unknown or application/unknown, it does  
> the
> same as the spec (except that it doesn't look for feeds). For pages  
> that
> _don't_ have Content-Type information, MOMspider could benefit greatly
> from following the HTML5 spec, as currently it uses a heuristic  
> based on
> the file extension rather than checking the file's content (the latter
> being a far more reliable indicator of file type).

I think you missed the point.  MOMspider uses a variety of mechanisms
to trim its traversal space to only those resources for which the type
is known to be hypertext and understood by its parser.  One of those
mechanisms is the result of a HEAD request that tells it, among other
things, the Content-Type.  If the Content-Type indicates text/plain,
MOMspider will never see the content and thus never be able to sniff.
It can't do otherwise without causing all of its other checks to use
GET, which would create an unacceptable bandwidth and load issue on
tested servers and eventually lead to it being banned.  MOMspider is
over twelve years old at this point, but I am sure that the same types
of behavior are present in today's link checkers.

Therefore, MOMspider (and its ilk) are effected by the accuracy of
content-type headers on the Web.  That was never an issue until MSIE
added sniffing without reporting errors, after which the mismatch
errors got steadily worse as the older browsers got replaced.
There is no way to compensate for this problem by causing all clients
to use the same sniffing algorithm -- some clients never see the
content, on purpose.

The solution is to require that compliant sniffing be combined with
compliant error reporting.  It is not a perfect solution, but it will
at least give us a chance to reintroduce feedback in the loop.

>>> There is currently *no way*, given an actual MIME type, for the
>>> algorithm in the HTML5 spec to sniff content not labelled explicitly
>>> as HTML to be treated as HTML. The only ways for the algorithms  
>>> in the
>>> spec to detect a document as HTML is if it has no Content-Type  
>>> header
>>> at all, or if it has a header whose value is unknown/unknown or
>>> application/unknown.
>>
>> Not even <embed src="myscript.txt" type="text/html">?
>
> <embed> does no sniffing whatsoever, it just honours the type=""  
> attribute
> instead of the Content-Type header. However, <embed>ing an HTML  
> file is
> unlikely to work since you are unlikely to have a plugin configured to
> read HTML files.

That assumes an awful lot about a specific implementation of embed.
The spec seems to imply an implementation could have built-in support,
and "typically non-HTML" would lead me to believe that HTML is allowed.
Combine that with the big red box about sniffing.

   http://www.whatwg.org/specs/web-apps/current-work/#embed

Maybe it is just my imagination, but the on-list discussion seemed to
be leaning toward adding more sniffing to HTML5 for embed and object,
not less.  My preference would be for the box to be prevented from
activating unless the types match or the user overrides, even if that
preference is only available under a non-default test/security mode.

>> I suggest restructuring the paragraphs into some sort of decision  
>> table
>> or structured diagram, since all the "goto section" bits make it
>> difficult to understand.
>
> Yeah, this will probably be rewritten to be easier to read in due  
> course.
> The sections are referred to from other parts of the spec, though,  
> so it's
> not just a matter of making it one section.
>
>
>>> Sadly, it is. Authors rely on UAs handling the URIs in <img>  
>>> elements
>>> as images regardless of Content-Type and HTTP response codes.  
>>> Authors
>>> rely on <script> elements parsing their resources as scripts
>>> regardless of the type of the remote resource. And so forth.  
>>> These are
>>> behaviours that are tightly integrated with the markup language.
>>
>> They don't rely on them -- they are simply not aware of the error.
>
> Ok, let's phrase it this way then. Users rely on UAs handling the  
> URIs in
> <img> elements as images regardless of Content-Type and HTTP response
> codes, so that pages they visit that are errorneous still render  
> usefully.

Why?  Users don't rely on that -- browser vendors do because they'd
rather whitewash errors than deal with questions.

>>> Furthermore, to obtain interoperable *and secure* behaviour when
>>> navigating across browsing contexts, be they top-level pages  
>>> (windows
>>> or tabs), or frames, iframes, or HTML <object> elements, we have to
>>> define how browsers are to handle navigation of content for those
>>> cases.
>>
>> Yes, but why can't that definition be in terms of the end-result  
>> of type
>> determination?  Are we talking about a procedure for sniffing in  
>> which
>> context is a necessary parameter, or just a procedure for handling  
>> the
>> results of sniffing per context.
>
> I don't understand the question. Could you elaborate? How does the  
> spec
> not define navigation (4.6. Navigating across documents) in terms  
> of type
> determination (4.7. Determining the type of a new resource in a  
> browsing
> context)?

I mean: can the algorithm be specified without every single use of
that algorithm being aware of its internal details?  Specs are just
another form of programming.  The procedure currently has several
entry points and a dozen exit points, and I am asking whether

   a) the sniffing procedure needs to be aware of the context to
      determine the sniffed type; or,

   b) the sniffing procedure is the same for all contexts, but how
      the result of sniffing is used/discarded changes by context.

If it is the former, then defining the procedure with a context
parameter makes sense (although it would be a lot easier to read
if each context value was dealt with individually as an outer case).

If it is the latter, then the context should only be discussed
where the result is used, not within the sniffing algorithm.
That would simplify the algorithm and place discussion about
when the result is used (or reported as an error) back in the
sections on the individual elements/actions that might sniff.

>>>> Orthogonal standards deserve orthogonal specs.  Why don't you  
>>>> put it
>>>> in a specification that is specifically aimed at Web Browsers and
>>>> Content Filters?
>>>
>>> The entire section in question is entitled "Web browsers". Browsers
>>> are one of the most important conformance classes that HTML5 targets
>>> (the other most important one being authors). We would be remiss  
>>> if we
>>> didn't define how browsers should work!
>>
>> Everything does not need to be defined in the same document.
>
> If you just want the content to be in a different file, you could  
> use the
> multipage version of the spec:
>
>    http://www.whatwg.org/specs/web-apps/current-work/multipage/ 
> section-content-type-sniffing.html#nav-bar
>
> ...but I don't see how that really changes anything. We can't  
> really split
> it into independent documents, since the content is all  
> interrelated. (We
> learnt with DOM2 HTML and HTML4 how it was a mistake to split the  
> related
> parts into separate specs -- you end up with things falling between  
> the
> cracks as spec writers define their scope in ways that don't quite  
> line up
> seamlessly.)

Well, that is an editorial issue.  The reason for placing it in
different specs is so that implementations of HTML-generating
applications would not have to read it.  YMMV.

Personally, I would prefer that HTML be defined according to its
ideal data definition (as if the world were a perfect place and
all generators produced exactly what we want to parse) and then
separately define a data transformation algorithm that, given any
tag soup, will consistently transform it to a valid HTML instance.
Such a thing is far easier to read, and test, than a specification
that tries to enshrine all of the special-case legacy handling
while at the same time defining the data model.  Again, YMMV.

>>>> I agree that a single sniffing algorithm would help, eventually,  
>>>> but
>>>> it still needs to be clear that overriding the server-supplied type
>>>> is an error by somebody, somewhere, and not part of "normal" Web
>>>> browsing.
>>>
>>> Of course it is. Who said otherwise?
>>
>> Where is the error handling specified for sniffed-type != media-type?
>
> The error _handling_, from a UA perspective, is defined in "4.7.
> Determining the type of a new resource in a browsing context". If,  
> on the
> other hand, you are asking where it says that it is an error for the
> author in the first place, then the answer is presumably in the HTTP
> specification, though I actually couldn't find any MUST  
> requirements there
> saying that the given Content-Type must match the actual type of the
> content. As far as _HTML_ goes, the HTML5 spec says:
>
> # HTML documents, if they are served over the wire (e.g. by HTTP)  
> must be
> # labelled with the text/html MIME type.
>
> ...in the "1.3. Conformance requirements" section.

I mean that it is missing:

    4.7.6  When sniffed type disagrees with Content-Type metadata

    If Content-Type metadata is present but differs from the sniffed
    type, then this discrepancy SHOULD be reported to the user as a
    content error unless such reporting has been turned off by
    configuration.  [... perhaps also disable script handling within
    the context of such a discrepancy ...]

....Roy

Received on Friday, 24 August 2007 00:05:44 UTC