charset flap

Roy,

The rough draft minutes didn't cover the full discussion.

1. WHY CHANGE draft...-05?

The primary observation was that draft-05 introduced an
INCOMPATIBILITY with HTTP/1.0 in that it changed the *meaning* of a
response in an incompatible way, and with a severe loss of
functionality. In HTTP/1.0, in order to reflect current practice,
untagged text <<content-type: text/html>> is interpreted as "charset
is unspecified, recipient must guess". We added language to change the
meaning of this, and this language was incompatible with 1.0:

> The "charset" parameter is used with some media types to define the
> character set (section 3.4) of the data. When no explicit charset
> parameter is provided by the sender, media subtypes of the "text" type
> are defined to have a default charset value of "ISO-8859-1" when
> received via HTTP. Data in character sets other than "ISO-8859-1" or its
> subsets MUST be labeled with an appropriate charset value.
 
This language is not only incompatible with HTTP/1.0, it is not in
conformance with what we believe will be future directions for other
Internet protocols; there is no reason to place ISO-8859-1 in this
position in HTTP. 

Furthermore, there is no recommended way to actually specify what is
the default situation with HTTP/1.0, which is that the charset is not
known.

So, these are sufficient reasons to consider a change to the -05
specification.

2. COMPATIBILITY WITH HTTP/1.0

The issue concerns the labelling of the charset of text/ entity bodies
in HTTP/1.1 messages. In HTTP/1.1 _response_ messages, it is possible,
and will be recommended implementation advice, that for graceful
deployment a server might respond differently to a HTTP/1.0 request
and a HTTP/1.1 request.

As you say, "there is nothing in HTTP that prevents a site, if it so
desires, from tagging all text types with an appropriate charset
parameter". However, HTTP/1.1 implementations must be prepared to deal
with an explicit charset parameter.

In the case of labelling HTTP requests as opposed to responses, the
version of the server may not be known.  However, the issue concerns
only the charset label on an entity body of type "text" in requests,
and generally only PUT and POST are sent with entity bodies in
HTTP/1.1.  POST requests are generally not sent with a content-type of
text (application/x-url-encoded being most common) and PUT is
generally only practiced between proprietary clients and their
corresponding servers. So it was believed that there was not a
compatibility issue with current practice in requiring that all entity
bodies be labelled with their charset.

3. HTTP/1.1 <-> HTTP/1.0 gateways

We discussed the issue of what a HTTP/1.1 proxy might do with an
entity body that was recieved from a HTTP/1.0 server without a charset
label. In general, it is deemed more reliable to not have "no label"
have a special meaning that cannot be otherwise represented. Other
Internet protocols use "charset=x-unknown" to represent the situation
where the character set was otherwise unknown.

This seemed like a reasonable practice to recommend to gateways.

4. Upgrading CGI & programs to HTTP/1.1

We discussed how current servers that were implementing HTTP/1.1 but
not upgrading CGI programs might label their data. It seemed
reasonable to assume that at a given site, if the CGI program did not
itself supply a charset parameter for the content-type of the return
value, the server might supply one itself based on the system default.

5. MUST vs. SHOULD

In the end, there was a choice:

 a) charset SHOULD be supplied with all responses
    no label means "US-ASCII superset, you guess"

    (I think this would be equivalent to changing "ISO-8859-1" to
    "US-ASCII" in the draft)

 b) charset MUST be supplied with all responses
    explicit "charset=x-unknown" if that's the case.

I believe choice (b) was acceptable to everyone in the room, including
HTTP/1.1 client and server implementors. The two choices are
practically the same except that choice (b) will promote the more
frequent use of an explicit "charset=x-unknown" for content where that
is the case.

Neither choice would seem to cause compatibility difficulties with
HTTP/1.0 clients or servers given a few precautions in servers and
version gateways.

Received on Thursday, 27 June 1996 03:14:41 UTC