- From: Larry Masinter <masinter@parc.xerox.com>
- Date: Mon, 8 Apr 1996 17:15:49 PDT
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
I have the issue 'CHARSET'. After the IAB character set workshop, it's clear that the entire situation with regard to character sets in IETF standards needs clarification, and that some of that work will happen. My goal here is to keep HTTP from getting tangled up in the mess, by avoiding controversial terminology or assertions. Section 3.4 defines the phrase "character set" and quotes a definition from (some) MIME document. On the other hand, the current document rarely uses the phrase "character set", instead using the more technical term "charset" in most places. To simplify things, I make the following proposal. If you don't like this proposal, let me know what you think we should do instead. I propose changing section 3.4 to be > 3.4 Character Sets and charset > HTTP uses the concept of a "charset" as defined in MIME[ref] to > designate a method by which a a sequence of logical characters might > be encoded as a sequence of octets and transmitted on the network, and > subsequently decoded into a sequence of characters by the recipient. > In MIME documents, the phrase "character set" is used informally to > mean the same thing as a "charset"; while for simple character > mappings such as US-ASCII this is reasonable, for more complex > methods, it is not. > IANA maintains a registry of charset names. This standard establishes > a profile of the IANA registry which consists, for each registered > charset whether it is intended for use in HTTP and a preferred-case > independent token for use in HTTP. > The following tokens are the initial preferred names to be used within > HTTP. This list includes those registered by RFC 1521 [7] -- the > US-ASCII [21] and ISO-8859 [22] character sets -- and other names > specifically recommended for use within MIME charset parameters. > charset = "US-ASCII" > | "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3" > | "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6" > | "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9" > | "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR" > | "UNICODE-1-1-UTF-8" [** These tokens don't seem to correspond to the tokens used by most of the multilingual servers in the world today!!! Why don't we fix them to match? EUC, Shift-JIS, Big5, GB... It was my impression that we might get IANA to assign a canonical and get out of this buisness. Why can't we do this? **] [** I note that there was a change suggested that 'the character set of an entity body should be labeled as the lowest common denominator of the character codes used within that body, with the exception that no label is preferred over the labels US-ASCII or ISO-8859-1', and object to this wording as confusing and unnecessary, and wonder why it was put in. **] Subsequent to this, there is a requirement to search for the phrase 'character set' and consider its replacement. In the revised 3.6.1 Canonicalization and Text Defaults, it borrows the phrase "character set" but then uses it inconsistently; as this was a hotly contested bit of wording, though, I think we can use 3.6.1 alone. In 10.2 "Accept-Charset", I suggest changing < The Accept-Charset request-header can be used to indicate what < character sets are acceptable for the response. This field allows < clients capable of understanding more comprehensive or special-purpose < character sets to signal that capability to a server which is capable < of representing documents in those character sets. to > The Accept-Charset request-header can be used to indicate which > values are acceptable as a charset parameter in any text media type > response. This request-header allows clients to indicate their > willingness to deal with text in alternate encodings." Move "The ISO-8859-1 character set can be assumed to be acceptable to all user agents" to the first sentence of the last paragraph, using "charset" instead of "character set". Change "Character set values are described in Section 3.4" to "Charset values are described in Section 3.4" and Scanning ahead for "character set", you will find one more instance in the description of "qc"; change the phrase "character set" to "charset". That's it.
Received on Monday, 8 April 1996 17:21:25 UTC