Content sniffing, feed readers, etc. (was HTML interpreter vs. HTML user agent) from Larry Masinter on 2009-05-29 (public-html@w3.org from May 2009)

From: Larry Masinter <masinter@adobe.com>
Date: Fri, 29 May 2009 12:46:39 -0700
To: Sam Ruby <rubys@intertwingly.net>, Anne van Kesteren <annevk@opera.com>
CC: Maciej Stachowiak <mjs@apple.com>, "Roy T. Fielding" <fielding@gbiv.com>, HTML WG <public-html@w3.org>
Message-ID: <8B62A039C620904E92F1233570534C9B0118CD95EB84@nambx04.corp.adobe.com>

I think there are a couple of issues that are be worth
separating out, in the discussion labeled
"HTML interpreters vs. HTML user agents".


Scope of the document: does the document we're working on
  apply to all HTML applications, only HTML interpreters,
  only HTML User Agents with Users, etc.

  I think the discussion forks into:
  (a) we could more easily reach consensus on the body 
      if the claimed scope were limited, by, for example,
      changing the title and abstract, or
  (b) the intent of the authors, the charter of the group, 
      and practical use, call for a language specification
      which is not narrowly scoped; we should fix the
      problems that would prevent its broad applicability?
  
  Does anyone see any other choice? I'd prefer (b),
  of course.
  
As an example of something in the document for which
scope is relevant, the issue of "content type sniffing"
was raised. Do the requirements for content-type 
sniffing only apply to "browsers", or to all HTML
processors including feed readers?

In this case, I think there are two separate situations
which have different perspectives:

a) Content-type sniffing of URIs within a HTML document
  itself: for references to external content, and
  processing rules which describe what those references are
  intended to mean. So, for example, if I say
  http://example.com/foo.gif in an <img>, I could define
  img@src to say, "if the protocol of the URI
  is http:, don't follow exactly the HTTP spec when
  interpreting the URI, but instead do the following", and
  describe HTML's own rules for content-type sniffing, and
  for treating images that *say* they are GIF files but
  *look* like they are JPEG files, well, as JPEG files.
 
  It's possible to do that. I don't like it much, I
  certainly think that it needs to be documented and
  reviewed and well-understood by network intermediaries
  that could care less about HTML and APIs and layout but
  want to scan JPEG images for security problems or naughty
  seditious images or whatever, and so a separate document
  with external review seems really important, but at
  least it's something that HTML *can* do.

b) Content type sniffing of HTML itself.

  This is the part I have trouble with. If I have a
  specification for a language, I could tell people how to
  recognize instances of that language.

  Let's say ISO defined "The Angle Bracket Language".  It
  consists of "Any string of characters in any encoding
  which contains angle brackets." 

  And I could give a rule -- "You should recognize any
  document with angle brackets as if it were served as
  text/angle-bracket, no matter what the MIME type is."

  But-- what is the scope of applicability of this new rule?
  Does it apply only to angle bracket processors?  Only web
  browsers? To anything that wants to be an angle-bracket
  processor but also wants to process HTML?

  Does the organization that publishes this fine
  new standard matter? If the W3C publishes it,
  does it now apply to all W3C specs?

  Does it apply to all web browsers, if it is a publication
  of W3C? To feed readers too?

  If it is published by ISO (oh, say, like ISO has published
  HTML4 https://www.cs.tcd.ie/15445/15445.HTML) can ISO
  define how other processors are to interpret HTTP
  results that say they are text/html but really --
  because they have angle brackets -- SHOULD be
  interpreted as text/angle-bracket?

I think the IETF delegated the authority to the W3C to
define what text/html and application/xhtml+xml "mean", and
the W3C membership, by their approval of the charter of this
working group, have delegated the authority to the W3C HTML
working group come up with a proposal, for member approval,
which defines text/html, and is working on deciding which
group(s) define application/xhtml+xml.

I don't see any authority or practical way in which this
working group could realistically define what anyone else
considers to be an instance of the language it is
defining.  Certainly the HTML specification can't redefine
"text/plain" to be anything other than "text/plain",
for references that are not themselves invoked from 
inside HTML.

Larry
-- 
http://larry.masinter.net

Received on Friday, 29 May 2009 19:47:50 UTC