Re: Content sniffing, feed readers, etc. (was HTML interpreter vs. HTML user agent) from Adam Barth on 2009-05-29 (public-html@w3.org from May 2009)

From: Adam Barth <w3c@adambarth.com>
Date: Fri, 29 May 2009 13:28:46 -0700
To: Larry Masinter <masinter@adobe.com>
Cc: Sam Ruby <rubys@intertwingly.net>, Anne van Kesteren <annevk@opera.com>, Maciej Stachowiak <mjs@apple.com>, "Roy T. Fielding" <fielding@gbiv.com>, HTML WG <public-html@w3.org>
Message-ID: <7789133a0905291328u6395c2f3v6699ecf9436aa12e@mail.gmail.com>

Usually I try to stay out of these political discussion, but I thought
I should comment on a few of the technical points.

On Fri, May 29, 2009 at 12:46 PM, Larry Masinter <masinter@adobe.com> wrote:
> As an example of something in the document for which
> scope is relevant, the issue of "content type sniffing"
> was raised. Do the requirements for content-type
> sniffing only apply to "browsers", or to all HTML
> processors including feed readers?

The sniffing algorithm is useful to implementations that wish to
interoperate with existing web content.  I suspect this includes feed
readers, image editors, and others.

> a) Content-type sniffing of URIs within a HTML document
>  itself: for references to external content, and
>  processing rules which describe what those references are
>  intended to mean. So, for example, if I say
>  http://example.com/foo.gif in an <img>, I could define
>  img@src to say, "if the protocol of the URI
>  is http:, don't follow exactly the HTTP spec when
>  interpreting the URI, but instead do the following", and
>  describe HTML's own rules for content-type sniffing, and
>  for treating images that *say* they are GIF files but
>  *look* like they are JPEG files, well, as JPEG files.

I don't understand what point you're making.  There is a corpus of
HTML documents and HTTP resources deployed in the world.
Implementations can use the sniffing algorithm to make sense of the
bytes they contain.  This involves some esoteric rules for determining
whether or not something is a GIF or a JPEG.

>  It's possible to do that. I don't like it much,

I don't think anyone *likes* content sniffing.  It's an unfortunate
reality of the world.

> b) Content type sniffing of HTML itself.
>
>  This is the part I have trouble with.

That's unfortunate, because this part is the most necessary for
compatibility and security.

>  If I have a
>  specification for a language, I could tell people how to
>  recognize instances of that language.
>
>  Let's say ISO defined "The Angle Bracket Language".  It
>  consists of "Any string of characters in any encoding
>  which contains angle brackets."
>
>  And I could give a rule -- "You should recognize any
>  document with angle brackets as if it were served as
>  text/angle-bracket, no matter what the MIME type is."

This doesn't seem useful because there aren't any instances of angle
bracket documents on the web that require this rule to recognize.  Put
another way, such a requirement wouldn't reflect reality.

>  But-- what is the scope of applicability of this new rule?
>  Does it apply only to angle bracket processors?  Only web
>  browsers? To anything that wants to be an angle-bracket
>  processor but also wants to process HTML?

I suspect the scope is fairly narrow because not many folks care about
recognizing angle bracket documents.  If a large number of HTTP
resources couldn't be processed correctly without this algorithm, then
the scope would grow.

>  Does the organization that publishes this fine
>  new standard matter? If the W3C publishes it,
>  does it now apply to all W3C specs?

This is a political question, but from a technical point of view, I
don't think it matters.  Either it's useful or not.  If it's useful,
I'll use it.

Adam

Received on Friday, 29 May 2009 20:29:39 UTC