Re: charset flap from Roy T. Fielding on 1996-06-28 (ietf-http-wg@w3.org from April to June 1996)

From: Roy T. Fielding <fielding@liege.ICS.UCI.EDU>
Date: Fri, 28 Jun 1996 05:13:07 -0700
To: Larry Masinter <masinter@parc.xerox.com>
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <9606280513.aa29399@paris.ics.uci.edu>
> 1. WHY CHANGE draft...-05?
> 
> The primary observation was that draft-05 introduced an
> INCOMPATIBILITY with HTTP/1.0 in that it changed the *meaning* of a
> response in an incompatible way, and with a severe loss of
> functionality. In HTTP/1.0, in order to reflect current practice,
> untagged text <<content-type: text/html>> is interpreted as "charset
> is unspecified, recipient must guess". We added language to change the
> meaning of this, and this language was incompatible with 1.0:

Observation of what?  What loss of functionality?  What browsers were
included in this observation?  Which ones were ignored?

In HTTP/1.0, untagged charset means ISO-8859-1 or one of its subsets, period.
All other interpretations are either broken or local derivatives (local
being defined as the user configuring the system such that it changes
the default to one that is understood within its range of usage and
outside the scope of an Internet standard). We have discussed this on
the list and off the list over a hundred times and the result is always
the same: the draft is correct as it stands.

The reason ISO-8859-1 is the default is because that is how the CERN libwww
(the basis for most HTTP clients prior to Netscape) and libwww-perl
(the basis for most HTTP clients using Perl) and the www-related Python
modules all interpret the meaning of no charset.  Moreover, some clients
like Mosaic 2.4 (and before) treat an explicit charset value as a different
type and refuse to display it even if the charset is iso-8859-1.

There did exist one client, Mosaic-i18n, which was hacked to guess the
charset from those it supported.  Although this worked in a locally
controlled environment (i.e., where everyone is using Mosaic-i18n or an
equivalent), it resulted in garbage being displayed on other browsers.
It was, in the end, the wrong solution to a common problem -- even the
developers agreed that the RIGHT solution was to change servers and the
client to support the charset parameter on media types, and to use the
charset parameter ONLY when the charset was not ISO-8859-1.  This 
implementation decision (it was NOT a change to the protocol) allows
special charsets to be identified and used by clients that understand
them, and for clients which only understand ISO-8859-1 to behave correctly
even if they are old and do not parse parameters.

Under no circumstances has it ever been true that the default charset
for media types in HTTP has been anything other than ISO-8859-1.  Any
application that treats it differently is BROKEN.

>> The "charset" parameter is used with some media types to define the
>> character set (section 3.4) of the data. When no explicit charset
>> parameter is provided by the sender, media subtypes of the "text" type
>> are defined to have a default charset value of "ISO-8859-1" when
>> received via HTTP. Data in character sets other than "ISO-8859-1" or its
>> subsets MUST be labeled with an appropriate charset value.
>  
> This language is not only incompatible with HTTP/1.0, it is not in
> conformance with what we believe will be future directions for other
> Internet protocols; there is no reason to place ISO-8859-1 in this
> position in HTTP. 

Six years of historical usage and all available software source code
says differently. This claim is bogus.

> So, these are sufficient reasons to consider a change to the -05
> specification.

How do they differ from the reasons discussed the last time the WG
decided this issue?  Why has this issue been reopened when all other
discussed-to-death issues remain closed?

> 2. COMPATIBILITY WITH HTTP/1.0
> 
> The issue concerns the labelling of the charset of text/ entity bodies
> in HTTP/1.1 messages. In HTTP/1.1 _response_ messages, it is possible,
> and will be recommended implementation advice, that for graceful
> deployment a server might respond differently to a HTTP/1.0 request
> and a HTTP/1.1 request.

Nonsense -- that is just plain wrong.  The HTTP version defines the
protocol capabilities of two adjacent parties of the communication,
at the time that they are communicating.  The HTTP version cannot be
used to differentiate between the content delivered EXCEPT when that
content is defined as being hop-to-hop (e.g., transfer encodings,
1xx status responses, communication options like persistence/keep-alive,
etc.).  The HTTP version cannot be used to alter the payload itself
because there is no guarantee that all eventual recipients of that
content will be using that version of HTTP (or even HTTP at all).

Furthermore, CORRECT HTTP/1.0 clients *will* support the charset
parameter because it is part of HTTP/1.0 -- the problem exists with
delivering new content to out-of-date or poorly implemented browsers --
so differentiating based on the protocol version will only create a
new problem for CORRECT clients.

> As you say, "there is nothing in HTTP that prevents a site, if it so
> desires, from tagging all text types with an appropriate charset
> parameter". However, HTTP/1.1 implementations must be prepared to deal
> with an explicit charset parameter.

Not applicable.

> In the case of labelling HTTP requests as opposed to responses, the
> version of the server may not be known.  However, the issue concerns
> only the charset label on an entity body of type "text" in requests,
> and generally only PUT and POST are sent with entity bodies in
> HTTP/1.1.  POST requests are generally not sent with a content-type of
> text (application/x-url-encoded being most common) and PUT is
> generally only practiced between proprietary clients and their
> corresponding servers. So it was believed that there was not a
> compatibility issue with current practice in requiring that all entity
> bodies be labelled with their charset.

On requests, that is generally correct.  At the same time, there is no
compatibility issue with the *default* charset on all entities being
ISO-8859-1.

> 3. HTTP/1.1 <-> HTTP/1.0 gateways
> 
> We discussed the issue of what a HTTP/1.1 proxy might do with an
> entity body that was recieved from a HTTP/1.0 server without a charset
> label. In general, it is deemed more reliable to not have "no label"
> have a special meaning that cannot be otherwise represented. Other
> Internet protocols use "charset=x-unknown" to represent the situation
> where the character set was otherwise unknown.

What other Internet protocols?  MIME doesn't.  HTTP doesn't.  FTP doesn't
have any such notion at all.  All current practice of HTTP, including
both specifications, will treat that as the experimental charset "x-unknown"
and refuse to display the entity.  Test it and see.

> This seemed like a reasonable practice to recommend to gateways.

That would be suicidal for any implementor -- I certainly won't
implement something that breaks all current practice.

> 4. Upgrading CGI & programs to HTTP/1.1
> 
> We discussed how current servers that were implementing HTTP/1.1 but
> not upgrading CGI programs might label their data. It seemed
> reasonable to assume that at a given site, if the CGI program did not
> itself supply a charset parameter for the content-type of the return
> value, the server might supply one itself based on the system default.

Not an issue -- CGI is part of the server implementation.

> 5. MUST vs. SHOULD
> 
> In the end, there was a choice:
> 
>  a) charset SHOULD be supplied with all responses
>     no label means "US-ASCII superset, you guess"
> 
>     (I think this would be equivalent to changing "ISO-8859-1" to
>     "US-ASCII" in the draft)

No, it wouldn't -- it would force servers to be incompatible with older
clients for no useful reason (aside from political puffery) in order to
be unconditionally compliant.  In other words, it means I would have to go
out and recommend that everyone be only conditionally compliant with the
specification, because only a fool would obey such a restraint.

>  b) charset MUST be supplied with all responses
>     explicit "charset=x-unknown" if that's the case.
> 
> I believe choice (b) was acceptable to everyone in the room, including
> HTTP/1.1 client and server implementors. The two choices are
> practically the same except that choice (b) will promote the more
> frequent use of an explicit "charset=x-unknown" for content where that
> is the case.

And break all older (pre mid-95) clients that share a common cache with
any HTTP/1.1 clients.  Brilliant.

> Neither choice would seem to cause compatibility difficulties with
> HTTP/1.0 clients or servers given a few precautions in servers and
> version gateways.

Prove it.  I believe otherwise, and I have considerably more experience with
both the protocol and its implementations than anyone who might have
been in attendance at Montreal.  Demonstrate to the WG that you have
adequately tested HTTP applications within an environment reasonably
representative of, say, 6 months from now when we are trying to deploy
HTTP/1.1 systems.  Do it within the scope of the design goals explicitly
stated (by me) for HTTP/1.1, and I will rescind my objection if it does
not cause the systems to fail or significantly reduce performance; that is,
test it with a system of hierarchical caching, on a regional or
national scale, where you do not have control over all parties to
the communication.

BTW, as you well know, consensus at a WG meeting only means consensus
among those in attendance -- not WG consensus, rough or otherwise.
Furthermore, I will remind you that we agreed not to make controversial
changes in HTTP/1.1, since the community needs a proposed standard NOW,
not six months from now.  I don't see why this issue, which is attempting
to solve a political problem by inventing untested and misinformed
technical "solutions", should be treated any differently than the
other real problems which have been postponed.

The charset issue is not a technical problem.  Any application written
according to the specification as it stands will interoperate with
any correctly-implemented HTTP/1.0 or HTTP/1.1 application.  Any server
that wishes to include an explicit charset value on all text types may
do so -- there is nothing preventing them from doing so aside from the
fear of misinterpretation by older clients, over which we have no control.

Making incompatible changes to the protocol to solve an implementation-
dependent non-problem is just plain stupid, particularly when we are
supposed to be FINISHED already.


 ...Roy T. Fielding
    Department of Information & Computer Science    (fielding@ics.uci.edu)
    University of California, Irvine, CA 92717-3425    fax:+1(714)824-4056
    http://www.ics.uci.edu/~fielding/
Received on Friday, 28 June 1996 05:25:29 UTC