- From: Roy T. Fielding <fielding@avron.ICS.UCI.EDU>
- Date: Wed, 11 Jan 1995 01:12:35 -0800
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
I have several things to say regarding the character set issue
which reflect the reality of Web technology and how it applies to
the HTTP standards process.
First, regarding standards:
As Dan said, the "official" standard will always lag behind existing
practice -- that is by design. Although some of the drafts we produce
will include things that have not-yet-been-implemented, they will not
be submitted for RFC consideration until we have at least two independent,
working implementations of everything that is contained in the final
draft. Note that this is also why we work on several versions of the
protocol at the same time -- good ideas that are not implemented get
shoved off to a later version.
Second, regarding character sets:
A character set defines the table of codes used to associate small
groups of bits within a document to their individual semantics
(in most cases, a common symbol to be manipulated and/or displayed).
It does not define the format of the overall document, nor does it
have any effect on the language(s) used within the document (other than
the incidental one that some languages cannot be represented using
the symbols defined by some character sets). Some interesting discussion
of languages and character set issues can be found in the Internet-Draft
<draft-ietf-mailext-lang-char-00.txt>.
Character set names for use in Internet protocols are registered with
IANA and listed in STD 2 (RFC 1700). This is what MIME uses, and what
HTTP will use. The current list includes:
US-ASCII
ISO-8859-1 ISO-8859-2 ISO-8859-3
ISO-8859-4 ISO-8859-5 ISO-8859-6
ISO-8859-7 ISO-8859-8 ISO-8859-9
Note that if your favorite character set is not listed in the above,
someone needs to get off their duff and have it registered by IANA.
There are many others that have been listed in related RFCs, e.g.
UNICODE-1-1 UNICODE-1-1-UTF-8 UNICODE-1-1-UTF-7
ISO-2022-JP ISO-2022-JP-2
ISO-2022-KR
There is also a separate list of character set names in STD 2 that is
not yet used by MIME. These are names approved for Internet documentation
(but not necessarily Internet protocols). The registry is at
<ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets>.
Note that none of the above is specific to HTTP, nor do we have any
intention of making it specific to HTTP.
Third, regarding media types:
A media type is an association between a (usually) large number of
bits and a document format. Most document formats have a default
character set which defines how the bits are grouped into meaningful
symbols. Some media types allow the character set to be defined
within the document itself. Some media types which do not have such
a capability (including those called text/* by MIME) are provided with
a charset="" parameter such that the same media type can be used with
character sets other than the default. The IANA registry is at:
<ftp://ftp.isi.edu/in-notes/iana/assignments/media-types>.
The character set is ALWAYS a feature of the media type, regardless
of whether or not the charset parameter is present. MIME defines the
default charset of all text/* types as US-ASCII. HTTP defines the default
charset of all text/* types as ISO-8859-1. In both cases, the default
can be overridden by including a charset="" parameter. Sending a
document containing bits intended to encode UTF-7 characters with the
header
Content-type: text/html
is ridiculous. If you want to send a UTF-7 encoded document, it must
be sent with (case-insensitive)
Content-type: text/html; charset="unicode-1-1-utf-7"
Like character sets, this is not an issue for discussion under HTTP.
HTTP simply uses what has already been defined for other Internet
protocols. Changing the default for text/* was a touchy issue, but
that reflects the reality of Web technology being better than SMTP
and has little effect on accessing older protocols through HTTP.
Changing the default to "implementation dependent" would make the
specification as broken as these clients.
Fourth, regarding applications that can't handle parameters on media types:
Fix them. Let's face it folks, we can't allow broken implementations
to limit the extensibility of the protocol. Browsers that cannot parse
parameters will never be able to handle character sets other than
ISO-8859-1. Nor will they be able to handle future versions of HTML
(which will be indicated by a level parameter). Thus, it makes perfect
sense for them to have to treat the response as application/octet-stream
(the default behavior) if the content-type is unknown.
Finally, regarding an Accept-charset or Accept-parameter header:
I believe it is a mistake to continue loading down the request syntax
with content negotiation information which is pointless 99.9999999%
of the time. On my system, a complete accept-charset listing would
add an additional 361 bytes to every request, and that's with a small
set of X fonts. The likelihood of me ever needing that information
is nil. To be useful, there would need to be a substantial set of
documents that are available in multiple character sets.
In the rare case where parallel sets of documents exist, it makes more
sense to provide a means for automated redirection via URCs and allow
the client to determine which is the "best" of available options.
A non-automated equivalent is the "click here for our Unicode version".
Sure, it's ugly, but nowhere near as ugly as 2 million clients trying
to ask ahead-of-time for every possible format and every possible
character set and every possible language accepted by the client for
the billion documents which are only available in a single
format/language/character set.
Having said that, I do expect that an Accept-charset header will appear
in HTTP/1.1. However, it will be used as all Accept-* headers should
be used -- only when the client wants to specifically restrict the
result to a set of options different than the default (as is the case for
in-line images today).
Accept-charset: UNICODE-1-1-UTF-8, iso-8859-*
would indicate that only UNICODE-1-1-UTF-8 and the iso-8859 set is
allowed as a response to this request.
BTW, Accept-parameter is not useful; charset is the only parameter
shared by multiple media types. We could invent some new parameters,
but that only makes the problem worse. Also, a quality attribute is
only useful if we add it to the content-negotiation algorithm --
something I would like to avoid (like the plague that it has become).
......Roy Fielding ICS Grad Student, University of California, Irvine USA
<fielding@ics.uci.edu>
<URL:http://www.ics.uci.edu/dir/grad/Software/fielding>
Received on Wednesday, 11 January 1995 01:16:42 UTC