charset Reality Check

I have several things to say regarding the character set issue
which reflect the reality of Web technology and how it applies to
the HTTP standards process.

First, regarding standards:

   As Dan said, the "official" standard will always lag behind existing
   practice -- that is by design.  Although some of the drafts we produce
   will include things that have not-yet-been-implemented, they will not
   be submitted for RFC consideration until we have at least two independent,
   working implementations of everything that is contained in the final
   draft.  Note that this is also why we work on several versions of the
   protocol at the same time -- good ideas that are not implemented get
   shoved off to a later version.

Second, regarding character sets:

   A character set defines the table of codes used to associate small
   groups of bits within a document to their individual semantics
   (in most cases, a common symbol to be manipulated and/or displayed).
   It does not define the format of the overall document, nor does it
   have any effect on the language(s) used within the document (other than
   the incidental one that some languages cannot be represented using
   the symbols defined by some character sets).  Some interesting discussion
   of languages and character set issues can be found in the Internet-Draft
   <draft-ietf-mailext-lang-char-00.txt>.

   Character set names for use in Internet protocols are registered with
   IANA and listed in STD 2 (RFC 1700).  This is what MIME uses, and what
   HTTP will use.  The current list includes:

      US-ASCII
      ISO-8859-1  ISO-8859-2  ISO-8859-3
      ISO-8859-4  ISO-8859-5  ISO-8859-6
      ISO-8859-7  ISO-8859-8  ISO-8859-9

   Note that if your favorite character set is not listed in the above,
   someone needs to get off their duff and have it registered by IANA.

   There are many others that have been listed in related RFCs, e.g.

      UNICODE-1-1  UNICODE-1-1-UTF-8  UNICODE-1-1-UTF-7
      ISO-2022-JP  ISO-2022-JP-2
      ISO-2022-KR

   There is also a separate list of character set names in STD 2 that is
   not yet used by MIME.  These are names approved for Internet documentation
   (but not necessarily Internet protocols).  The registry is at
   <ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets>.

   Note that none of the above is specific to HTTP, nor do we have any
   intention of making it specific to HTTP.

Third, regarding media types:

   A media type is an association between a (usually) large number of
   bits and a document format.  Most document formats have a default
   character set which defines how the bits are grouped into meaningful
   symbols.  Some media types allow the character set to be defined
   within the document itself.  Some media types which do not have such
   a capability (including those called text/* by MIME) are provided with
   a charset="" parameter such that the same media type can be used with
   character sets other than the default.  The IANA registry is at:
   <ftp://ftp.isi.edu/in-notes/iana/assignments/media-types>.

   The character set is ALWAYS a feature of the media type, regardless
   of whether or not the charset parameter is present.  MIME defines the
   default charset of all text/* types as US-ASCII.  HTTP defines the default
   charset of all text/* types as ISO-8859-1.  In both cases, the default
   can be overridden by including a charset="" parameter.  Sending a
   document containing bits intended to encode UTF-7 characters with the
   header

         Content-type: text/html

   is ridiculous.  If you want to send a UTF-7 encoded document, it must
   be sent with (case-insensitive)

         Content-type: text/html; charset="unicode-1-1-utf-7"

   Like character sets, this is not an issue for discussion under HTTP.
   HTTP simply uses what has already been defined for other Internet
   protocols.  Changing the default for text/* was a touchy issue, but
   that reflects the reality of Web technology being better than SMTP
   and has little effect on accessing older protocols through HTTP.
   Changing the default to "implementation dependent" would make the
   specification as broken as these clients.
     
Fourth, regarding applications that can't handle parameters on media types:

   Fix them.  Let's face it folks, we can't allow broken implementations
   to limit the extensibility of the protocol.  Browsers that cannot parse
   parameters will never be able to handle character sets other than
   ISO-8859-1.  Nor will they be able to handle future versions of HTML
   (which will be indicated by a level parameter).  Thus, it makes perfect
   sense for them to have to treat the response as application/octet-stream
   (the default behavior) if the content-type is unknown.

Finally, regarding an Accept-charset or Accept-parameter header:

   I believe it is a mistake to continue loading down the request syntax
   with content negotiation information which is pointless 99.9999999%
   of the time.  On my system, a complete accept-charset listing would
   add an additional 361 bytes to every request, and that's with a small
   set of X fonts.  The likelihood of me ever needing that information
   is nil.  To be useful, there would need to be a substantial set of
   documents that are available in multiple character sets.

   In the rare case where parallel sets of documents exist, it makes more
   sense to provide a means for automated redirection via URCs and allow
   the client to determine which is the "best" of available options.
   A non-automated equivalent is the "click here for our Unicode version".
   Sure, it's ugly, but nowhere near as ugly as 2 million clients trying
   to ask ahead-of-time for every possible format and every possible
   character set and every possible language accepted by the client for
   the billion documents which are only available in a single
   format/language/character set.
   
   Having said that, I do expect that an Accept-charset header will appear
   in HTTP/1.1.  However, it will be used as all Accept-* headers should
   be used -- only when the client wants to specifically restrict the
   result to a set of options different than the default (as is the case for
   in-line images today).

     Accept-charset: UNICODE-1-1-UTF-8, iso-8859-*

   would indicate that only UNICODE-1-1-UTF-8 and the iso-8859 set is
   allowed as a response to this request.

   BTW, Accept-parameter is not useful; charset is the only parameter
   shared by multiple media types.  We could invent some new parameters,
   but that only makes the problem worse.  Also, a quality attribute is
   only useful if we add it to the content-negotiation algorithm --
   something I would like to avoid (like the plague that it has become).


......Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                                     <fielding@ics.uci.edu>
                     <URL:http://www.ics.uci.edu/dir/grad/Software/fielding>

Received on Wednesday, 11 January 1995 01:16:42 UTC