Re: Proposed TAG Finding: Internet Media Type registration, consistency of use from Tantek Çelik on 2002-06-10 (www-tag@w3.org from June 2002)

From: Tantek Çelik <tantek@cs.stanford.edu>
Date: Sun, 09 Jun 2002 18:38:14 -0700
To: www-tag@w3.org
Cc: Tim Bray <tbray@textuality.com>, Al Gilman <asgilman@iamdigex.net>, Mike Dierken <mike@dataconcert.com>
Message-id: <B9295115.DFB1%tantek@cs.stanford.edu>
After some more consideration...


On 5/22/02 12:51 PM, "Mike Dierken" <mike@dataconcert.com> wrote:

>> -----Original Message-----
>> From: Tantek Çelik [mailto:tantek@cs.stanford.edu]
>> 
>>> Second guessing behavior, while useful for entrenched companies to
>>> reenforce dependencies on implementation quirks, has
>>> effects that are
>>> deleterious to the long-term health of the web.
>> 
>> This is nonsense.
>> 
>> There is nothing stopping web servers from fixing their
>> configuration files etc. to return proper mime types.
>> 
> 
> I think the issue is that the web server DOES have a correct configuration
> file and DOES correctly return a propert mime type - yet the browser does
> not obey that response header. This is especially true when 'text/plain' is
> used with data that has content sort of similar to HTML (there are probably
> other more clear examples).

Yes, 'text/plain' is most problematic from the perspective of supporting a
legacy portion of the web.  "text/html" documents are often incorrectly
served as "text/plain", and, as you point out, even properly served
"text/plain" documents are "upgraded" by UAs to "text/html" based on an
imperfect heuristic.

My point was that there is nothing stopping web servers from fixing the
former problem ("text/html" documents are often incorrectly served as
"text/plain").

But unfortunately, as long as that problem exists to any great degree, it
will be very difficult to fix the latter problem ("text/plain" documents are
"upgraded" by UAs to "text/html" based on an imperfect heuristic) without
orphaning some non-trivial subset of the web.

By orphaning I mean just that.  If UAs all of sudden treated all those
"text/html" pages that are served as "text/plain" as "text/plain", then any
documents that those pages linked to would become unlinked from the web.
A lot of these pages tend to be on old or low-end dilapidated servers, often
in a heavily shared resource situation where the authors do not even control
the server - a "web ghetto" as it were.

Serving these pages as "text/plain" is the equivalent to pruning out the
portion of the web referenced and served by such misbehaving/misconfigured,
since the hyperlinks on those pages will appear as so much angle-bracket
gibberish to the user.

I think it is inappropriate for a UA to blindly perform a technical action
which non-trivially reduces the effective size of the web from the user's
perspective, and for that matter, may be inadvertently discriminating those
in the "web ghetto".

This is why the "text/html" as "text/plain" example is a poor example in the
TAG finding.

I say poor example, because, as I said in my previous posting[1], I do think
avoiding sniffing is in general a good policy.  There are many other
examples of sniffing related misbehaviors that the TAG could have chosen
that would be easier to justify, and less damaging to the web.


For example, take the various "image/*" content types.

Since there aren't many images that contain hyperlinks, treating images as
they are literally typed by the server only has the downside of showing a
few more "broken image" icons - not having whole sections of the web cutoff.


Another example: the "text/css" content type.

A UA that only supports "text/css" as a styling language is supposed to
ignore style sheets that are typed as anything else, _even_if_ they "look"
like CSS.

Here is an excellent test page by Ian Hickson which demonstrates the (lack
of) conformance in most current UAs:

 http://www.bath.ac.uk/~py8ieh/internet/importtest/extra/linklanguage.html

Both 68a and 68b should be unstyled, because what matters is not only the
'type' attribute in the LINK tag, but the actual mime type as returned by
the server.

Here is a real world example.  This page:

 http://fyi.cnn.com/fyi/teachers.ednews

references this style sheet:

 http://fyi.cnn.com/virtual/2001/style/fyi.css

using a LINK tag which claims the destination is "text/css".  However, the
server returns data of type "application/x-pointplus":


x.x.x.x: GET http://fyi.cnn.com/virtual/2001/style/fyi.css HTTP/1.0
Host: fyi.cnn.com
From: 157.57.210.139
User-Agent: MacNL/1.1
UA-OS: MacOS
UA-CPU: PPC
Accept: */*
Accept-Language: en
Referer: http://fyi.cnn.com/fyi/teachers.ednews/
Accept-Encoding: gzip

x.x.x.x: HTTP/1.1 200 OK
Content-Length: 1964
Age: 535
Date: Mon, 03 Jun 2002 19:15:32 GMT
Content-type: application/x-pointplus
 


Again contrary to the "text/plain" example, ignoring such mistyped style
sheets simply results in the user seeing a default styled view of the
document - the user can still access the various hyperlinks in the document,
and there is no resultant pruning of the web.


E.g. this could have been in the TAG finding instead:

"An example of incorrect behavior is a user-agent that reads some part of
the body of a response and decides to treat it as CSS based on its
containing a ':link' style rule, or based on being referenced as type
'text/css', when it was served as 'application/x-pointplus' or some other
non-CSS type."

[I omitted the word "dangerous" because I still think it is irresponsible to
make such a claim without a specific reference.]


> Another point, if the user agent does not obey a specified content-type -
> possibly displaying something as plain text even though angle brackets exist
> in the content - how will authors know that their configuration files need
> fixing? It will appear to be working fine and the configuration file will
> remain broken.

All validators are user agents, but not all user agents are validators, nor
should they be expected to be.

So, to answer your question of "how ... ?" - authors should _always_
validate their content using one or more of the free validators that exist,
and validators should be very strict about reporting incorrect mime typing,
as for example, the W3C CSS validator does with the above-mentioned style
sheet:

<http://jigsaw.w3.org/css-validator/validator?uri=+http%3A%2F%2Ffyi.cnn.com%
2Fvirtual%2F2001%2Fstyle%2Ffyi.css&warning=2&profile=css2>

"I/O Error: Unknown mime type : application/x-pointplus"


continuing onward....



On 5/23/02 2:58 PM, "Tim Bray" <tbray@textuality.com> wrote:

> Tantek Çelik wrote:
>>> http://www.w3.org/2001/tag/2002/0129-mime
> 
>> While it is a laudable goal to avoid and/or limit sniffing when at all
>> possible, unsubstantiated comments like these are inflammatory at best, and
>> horribly naive at worst - given how many HTML (.html etc.) pages are still
>> served as text/plain. (Nevermind GIFs and other images served as
>> text/plain).
> 
> Tantek's comments ought to be taken seriously if for no other reason
> that he's one of the people behind arguably the world's most
> standards-compliant user-agent, namely IE for the Mac.

Thanks for the kind words Tim.


> I do sense a 
> slight tension between his admirable track record of Doing the Right
> Thing and Taking the Consequences and his position here...

I certainly do believe in Doing the Right Thing and Taking the Consequences,
but stopping when it does more harm than good.  At that point I stop and ask
why.  That is what I am doing in this thread.

Hopefully that threshold moves further and further out over the course of
time, and with each new release, developers can do more and more of "the
Right Thing" without adverse consequences.

> Anyhow, for the moment I stand by the position that sniffing is always
> without exception bad when you're figuring out how to do top-level
> dispatch.

In an ideal world yes.  On today's web no.  Again, unless you want to
discriminate against the "web ghetto".


> It opens horrible security holes

Please provide a reference or documentation about how interpreting a
"text/plain" resource as "text/html" opens a security hole.


> and when breakage does
> occur, it focuses the blame away from where it belongs, namely people
> who screw up in configuring their webservers.

Is it the responsibility of a conforming UA to properly focus blame?  I must
have missed that conformance requirement. 1/2 ;-)


> I think if someone serves
> a gzipped file or an XML file or a plain text file as text/html, or an
> HTML file as text/plain, then this *should* break visibly in users'
> browsers.  Similarly if they serve an HTML file as text/plain.

Both "text/plain" and "text/html" are horribly polluted mime types as far as
today's web is concerned.  Pretending otherwise doesn't all of sudden
magically make them work.


> Why? 
> Well, that's Tantek's next point.
> 
>>   "Web software SHOULD NOT attempt to recover from such errors by guessing,
>> but SHOULD report the error to the user to allow intelligent corrective
>> action."
>> 
>> Typically a user of a web site does not have the ability to correct the
>> website itself.  Nevermind perform an "intelligent corrective action".
>> Which usability genius decided that it was a good idea to report errors to
>> the user that are meaningless to the typical user (typical user has zero
>> knowledge about mime types) and the user has no chance of fixing?
>> 
>> If a UA did report such errors with a web site, the typical user would take
>> the corrective action they usually take when errors are reported from a
>> website, and that is to try a different UA.
> 
> No.  Most people are only vaguely aware that there are other browsers
> than what they're using, and on your typical off-the-shelf Wintel or Mac
> box these days, quite likely IE is all there is.

This has to be the largest collective misconception in the history of the
web.

For one thing, IE5/Mac (that brave UA that Did the Right Thing and Took the
Consequences), achieved about 60% share (up from 15% or so for IE4/Mac) on
Macs through _downloads_alone_ in the first six months of availability.  It
wasn't until at least a year and half later that IE5/Mac made it into
Apple's OS bundling (along with the latest NS4.x of course).

Perhaps Mac users are different (no, I know Mac users are different), but I
think it is greatly underestimated how many people download and run
"alternative" or "updated" browsers - if those browsers are worth the time
to download (see various CNET reviews of browsers etc. as reference).  It is
a well-known fact that the average Mac user has 3+ browsers installed and
uses 2+ of them regularly.


> In fact, what people will do is, if they care at all about the website
> content, to try to find a way to complain (appropriate) and if they
> don't care that much, they go elsewhere (appropriate).

Well, speaking from experience that was not the response we got with
IE4/Mac, which had scripting error alerts turned on by default.  Whenever
IE4/Mac reported a scripting error - users' first response was to blame the
browser - not the site, and then switch to another browser.  In fact they
thought the error was a problem in the application itself - indicating some
sort of instability.

So, in IE5/Mac, we turned off scripting error alerts by default and stuck
with the little "broken script" warning icon in the status.  All of sudden
people thought the browser was "much more stable" on scripting intensive
sites.

Bottom line: users associate error dialogs with the _application_ that
displays them, not the _content_ to which they may be related.

Just after the bottom line: I still think it is inappropriate for the TAG to
be making this kind of specific user interface recommendation.  After all it
is the "TAG", not the "UIAG".



onward....


On 5/30/02 6:22 AM, "Al Gilman" <asgilman@iamdigex.net> wrote:

> At 05:33 AM 2002-05-30, Steven Pemberton wrote:
> 
>> From: "Tim Bray" <tbray@textuality.com>
>>> Anyhow, for the moment I stand by the position that sniffing is always
>>> without exception bad when you're figuring out how to do top-level
>>> dispatch.  It opens horrible security holes and when breakage does
>>> occur, it focuses the blame away from where it belongs, namely people
>>> who screw up in configuring their webservers.
>> 
>> I think this is the major problem: it protects the guilty, and penalises the
>> innocent.
>> 
>> I have long wanted to be able to write web-based tutorials along these
>> lines:
>> 
>>    Here is an HTML document to illustrate this technique:
>>        http://www.cwi.nl/~steven/test/img-test.html
>>    and here is its source: http://www.cwi.nl/~steven/test/img-test.txt
>> 
>> but I can't, because in IE you get exactly the same results for the two
>> links (try it on the above). How can I get IE to do the right thing? I
>> can't! It *always* presents my file served as text/plain as if I had served
>> it as text/html. It prevents people from doing the right thing...
> 
> How committed are the consumers of your web-based tutorials?  Are they casual
> Web walk-ons, or have they enrolled in something?
> 
> Two out of the three browsers I happen to have installed do this the way you
> want.  The other two are readily available for free.  In some educational
> situations, asking the users to install some client or client extensions is
> de_rigeur.  You don't have to specify the browser, just the standards
> compliance profile that they meet, and you can inform your customers of know
> non-compliant configurations.  That is why I ask how persistent is your
> relationship with the learners.
> 
> If the dominant player does something wrong, this is an problem of enterprise
> proportions for the Web.

And this cuts both ways.  What is wrong?  Cutting off websites?  Or
sniffing?  Both?  It is a damned if you do and damned if you don't
situation.


> But still, it is not clear that the TAG list is the
> place to pursue a specific beef against a specific implementation.  If that is
> all we are dealing with, we don't need to wrap it in the trappings of TAG
> findings and architectural principles.

Agreed.  Thanks for pointing this out Al.


> In this case, control by user supervision alone is inadequate.  Users accept
> the unquestioned 'upgrade' to HTML processing in such proportions that this
> behavior is what the market will bear.  The system is showing not enough
> stiffness against violating _what the author said and expects to happen_.
> 
> We need both a stiffer control loop around what the servers serve as type
> information, and a more faithful response from user agents.  These are
> interdependent.  So we need a roadmap for a closed-loop transition practice
> that will migrate the operating point to where we would rather be.  This does
> sound like a 'delegate to QA' perhaps.

Agreed.  And stronger wording for content will help with this "control
loop".


Steven Pemberton wrote:

>> And if IE thinks your tar archive is an HTML file, well bad luck for you and
>> your users.
>> 
>> It would be *really really good* if IE offered an option to switch off
>> content switching, and even a dialogue, so that people could get an idea
>> that something was wrong:
>> 
>>    This document has been served as text/plain but looks like an HTML file.
>>    What do you want to do:
>>        [ ] View it as HTML
>>        [ ] View it as text
>> 
>>    [ ] Never ask me this question again.
>> 
>> Steven Pemberton

An excellent suggestion Steven - and for that matter a much better
suggestion that the error reporting than is recommended in the TAG document.
I nominate Steve for the UIAG.

I'll see what I can do.


Finally, I will note that the TAG document:

 http://www.w3.org/2001/tag/2002/0129-mime

has been updated with the statement: "The TAG notes that Tantek Çelik
expressed dissent about this finding."

I don't think that is a precisely correct statement.

I expressed dissent about two very specific things in the finding:

1. This statement is wrong/inflammatory/exaggerated etc.

"An example of incorrect and dangerous behavior is a user-agent that reads
some part of the body of a response and decides to treat it as HTML based on
its containing a <!DOCTYPE declaration or <title> tag, when it was served as
text/plain or some other non-HTML type."

2. I think it is inappropriate for the TAG to make specific user interface
recommendation such as the second half of this statement:

"Web software SHOULD NOT attempt to recover from such errors by guessing,
but SHOULD report the error to the user to allow intelligent corrective
action."

I do not have issues with the rest of the finding, and in fact, have
stated[1] that it is a good idea to avoid sniffing - which is what I believe
the finding is trying to state.

I have documented recommended alternative wording in this email with respect
to better examples (to address my point 1.), and still maintain that the
second clause (beginning with the words "but SHOULD report") in my point 2.
should be stricken.


Thanks,

Tantek


[1]
 http://lists.w3.org/Archives/Public/www-tag/2002May/0121.html
Received on Sunday, 9 June 2002 21:32:28 UTC