Issue: CHARSET, character sets and charset from Larry Masinter on 1996-04-09 (ietf-http-wg@w3.org from April to June 1996)

From: Larry Masinter <masinter@parc.xerox.com>
Date: Mon, 8 Apr 1996 17:15:49 PDT
To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <96Apr8.171604pdt.2764@golden.parc.xerox.com>
I have the issue 'CHARSET'. After the IAB character set workshop, it's
clear that the entire situation with regard to character sets in IETF
standards needs clarification, and that some of that work will happen.
My goal here is to keep HTTP from getting tangled up in the mess, by
avoiding controversial terminology or assertions.

Section 3.4 defines the phrase "character set" and quotes a definition
from (some) MIME document. On the other hand, the current document
rarely uses the phrase "character set", instead using the more
technical term "charset" in most places.

To simplify things, I make the following proposal. If you don't like
this proposal, let me know what you think we should do instead.

I propose changing section 3.4 to be


> 3.4 Character Sets and charset

> HTTP uses the concept of a "charset" as defined in MIME[ref] to
> designate a method by which a a sequence of logical characters might
> be encoded as a sequence of octets and transmitted on the network, and
> subsequently decoded into a sequence of characters by the recipient.

> In MIME documents, the phrase "character set" is used informally to
> mean the same thing as a "charset"; while for simple character
> mappings such as US-ASCII this is reasonable, for more complex
> methods, it is not.

> IANA maintains a registry of charset names. This standard establishes
> a profile of the IANA registry which consists, for each registered
> charset whether it is intended for use in HTTP and a preferred-case
> independent token for use in HTTP.

> The following tokens are the initial preferred names to be used within
> HTTP. This list includes those registered by RFC 1521 [7] -- the
> US-ASCII [21] and ISO-8859 [22] character sets -- and other names
> specifically recommended for use within MIME charset parameters.

>   charset = "US-ASCII"
>	| "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3"
>	| "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6"
>	| "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9"	
>       | "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR"
>       | "UNICODE-1-1-UTF-8"


[** These tokens don't seem to correspond to the tokens used by most
of the multilingual servers in the world today!!! Why don't we fix
them to match? EUC, Shift-JIS, Big5, GB...  It was my impression that
we might get IANA to assign a canonical and get out of this buisness.
Why can't we do this? **]

[** I note that there was a change suggested that 'the character set
of an entity body should be labeled as the lowest common denominator
of the character codes used within that body, with the exception that
no label is preferred over the labels US-ASCII or ISO-8859-1', and
object to this wording as confusing and unnecessary, and wonder why it
was put in. **]

Subsequent to this, there is a requirement to search for the phrase
'character set' and consider its replacement.  In the revised 3.6.1
Canonicalization and Text Defaults, it borrows the phrase "character
set" but then uses it inconsistently;  as this was a hotly contested
bit of wording, though, I think we can use 3.6.1 alone.

In 10.2 "Accept-Charset", I suggest changing

< The Accept-Charset request-header can be used to indicate what
< character sets are acceptable for the response. This field allows
< clients capable of understanding more comprehensive or special-purpose
< character sets to signal that capability to a server which is capable
< of representing documents in those character sets.

to

> The Accept-Charset request-header can be used to indicate which
> values are acceptable as a charset parameter in any text media type
> response. This request-header allows clients to indicate their
> willingness to deal with text in alternate encodings." 

Move "The ISO-8859-1 character set can be assumed to be acceptable
to all user agents" to the first sentence of the last paragraph, using
"charset" instead of "character set".

Change "Character set values are described in Section 3.4" to
"Charset values are described in Section 3.4" and 

Scanning ahead for "character set", you will find one more instance in
the description of "qc"; change the phrase "character set" to
"charset".

That's it.
Received on Monday, 8 April 1996 17:21:25 UTC