- From: Roy T. Fielding <fielding@avron.ICS.UCI.EDU>
- Date: Wed, 11 Jan 1995 01:12:35 -0800
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
I have several things to say regarding the character set issue which reflect the reality of Web technology and how it applies to the HTTP standards process. First, regarding standards: As Dan said, the "official" standard will always lag behind existing practice -- that is by design. Although some of the drafts we produce will include things that have not-yet-been-implemented, they will not be submitted for RFC consideration until we have at least two independent, working implementations of everything that is contained in the final draft. Note that this is also why we work on several versions of the protocol at the same time -- good ideas that are not implemented get shoved off to a later version. Second, regarding character sets: A character set defines the table of codes used to associate small groups of bits within a document to their individual semantics (in most cases, a common symbol to be manipulated and/or displayed). It does not define the format of the overall document, nor does it have any effect on the language(s) used within the document (other than the incidental one that some languages cannot be represented using the symbols defined by some character sets). Some interesting discussion of languages and character set issues can be found in the Internet-Draft <draft-ietf-mailext-lang-char-00.txt>. Character set names for use in Internet protocols are registered with IANA and listed in STD 2 (RFC 1700). This is what MIME uses, and what HTTP will use. The current list includes: US-ASCII ISO-8859-1 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-5 ISO-8859-6 ISO-8859-7 ISO-8859-8 ISO-8859-9 Note that if your favorite character set is not listed in the above, someone needs to get off their duff and have it registered by IANA. There are many others that have been listed in related RFCs, e.g. UNICODE-1-1 UNICODE-1-1-UTF-8 UNICODE-1-1-UTF-7 ISO-2022-JP ISO-2022-JP-2 ISO-2022-KR There is also a separate list of character set names in STD 2 that is not yet used by MIME. These are names approved for Internet documentation (but not necessarily Internet protocols). The registry is at <ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets>. Note that none of the above is specific to HTTP, nor do we have any intention of making it specific to HTTP. Third, regarding media types: A media type is an association between a (usually) large number of bits and a document format. Most document formats have a default character set which defines how the bits are grouped into meaningful symbols. Some media types allow the character set to be defined within the document itself. Some media types which do not have such a capability (including those called text/* by MIME) are provided with a charset="" parameter such that the same media type can be used with character sets other than the default. The IANA registry is at: <ftp://ftp.isi.edu/in-notes/iana/assignments/media-types>. The character set is ALWAYS a feature of the media type, regardless of whether or not the charset parameter is present. MIME defines the default charset of all text/* types as US-ASCII. HTTP defines the default charset of all text/* types as ISO-8859-1. In both cases, the default can be overridden by including a charset="" parameter. Sending a document containing bits intended to encode UTF-7 characters with the header Content-type: text/html is ridiculous. If you want to send a UTF-7 encoded document, it must be sent with (case-insensitive) Content-type: text/html; charset="unicode-1-1-utf-7" Like character sets, this is not an issue for discussion under HTTP. HTTP simply uses what has already been defined for other Internet protocols. Changing the default for text/* was a touchy issue, but that reflects the reality of Web technology being better than SMTP and has little effect on accessing older protocols through HTTP. Changing the default to "implementation dependent" would make the specification as broken as these clients. Fourth, regarding applications that can't handle parameters on media types: Fix them. Let's face it folks, we can't allow broken implementations to limit the extensibility of the protocol. Browsers that cannot parse parameters will never be able to handle character sets other than ISO-8859-1. Nor will they be able to handle future versions of HTML (which will be indicated by a level parameter). Thus, it makes perfect sense for them to have to treat the response as application/octet-stream (the default behavior) if the content-type is unknown. Finally, regarding an Accept-charset or Accept-parameter header: I believe it is a mistake to continue loading down the request syntax with content negotiation information which is pointless 99.9999999% of the time. On my system, a complete accept-charset listing would add an additional 361 bytes to every request, and that's with a small set of X fonts. The likelihood of me ever needing that information is nil. To be useful, there would need to be a substantial set of documents that are available in multiple character sets. In the rare case where parallel sets of documents exist, it makes more sense to provide a means for automated redirection via URCs and allow the client to determine which is the "best" of available options. A non-automated equivalent is the "click here for our Unicode version". Sure, it's ugly, but nowhere near as ugly as 2 million clients trying to ask ahead-of-time for every possible format and every possible character set and every possible language accepted by the client for the billion documents which are only available in a single format/language/character set. Having said that, I do expect that an Accept-charset header will appear in HTTP/1.1. However, it will be used as all Accept-* headers should be used -- only when the client wants to specifically restrict the result to a set of options different than the default (as is the case for in-line images today). Accept-charset: UNICODE-1-1-UTF-8, iso-8859-* would indicate that only UNICODE-1-1-UTF-8 and the iso-8859 set is allowed as a response to this request. BTW, Accept-parameter is not useful; charset is the only parameter shared by multiple media types. We could invent some new parameters, but that only makes the problem worse. Also, a quality attribute is only useful if we add it to the content-negotiation algorithm -- something I would like to avoid (like the plague that it has become). ......Roy Fielding ICS Grad Student, University of California, Irvine USA <fielding@ics.uci.edu> <URL:http://www.ics.uci.edu/dir/grad/Software/fielding>
Received on Wednesday, 11 January 1995 01:16:42 UTC