Re: Unregistered charset values in HTTP 1.1, the ISO-8859-* values from Olle Jarnefors on 1996-07-10 (ietf-http-wg@w3.org from July to September 1996)

From: Olle Jarnefors <ojarnef@admin.kth.se>
Date: Wed, 10 Jul 96 10:51:53 +0200
To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, iesg@ietf.org
Cc: Olle Jarnefors <ojarnef@admin.kth.se>
Message-Id: <9607100851.AA08802@othello.admin.kth.se>
Larry Masinter <masinter@parc.xerox.com> wrote:

> > 3.4 Character Sets
> 
> > Although HTTP allows an arbitrary token to be used as a charset value,
> > any token that has a predefined value within the IANA Character Set
> > registry MUST represent the character set defined by that registry.
> 
> and Olle presented a reading of this that makes it seem like HTTP is
> permitting unregistered charset values. Frankly, I don't think anyone
> cares much what this says; it was the case that the MIME documents
> were in flux. Clearly, using non-standard charset values is
> non-standard.

I wouldn't go that far. Certainly for experimental
purposes or within a certain community a character set
that still hasn't been registered should be possible to
use. But in that case the use of a charset value starting
with "x-" (and therefore guaranteed to never be
registered by IANA) should be required by the HTTP 1.1
specification, as it always has been in MIME
specifications.

As an example, among users of the Sami languages (also
called Lappish) in Northern Norway, Sweden and Finland, a
new 8-bit character set, similar to ISO 8859-1 but
including also the letters used in Sami that 8859-1
lacks, was recently defined and it is now tried out in
email and on the WWW. "charset=x-saami" is used to
identify this character set, which is not yet registered
with IANA. Though the designers of this new character set
hopes it will eventually become part 16 of ISO 8859, I
don't think the HTTP specification should allow the use
of "charset=iso-8859-16" for this character set, or any
other, until "iso-8859-16" has been registered as a
charset value by IANA.

I suggest that the quote from the draft above is
amended by this text:

+ Unregistrered character sets may be used for
+ experimental purposes or according to the convention of
+ a certain user community, but a value starting with
+ "X-" or "x-" must then be used. Such values are
+ guaranteed never to be registered by IANA.

> However, most of the comments here do not change the HTTP
> specification, but only the recommendations for what should happen to
> the IANA registration. As such, I would assume that the charset
> registration issues could merged with the consideration of
> draft-freed-charset-reg-00.txt as BCP, while the HTTP/1.1 spec could
> be processed.

Yes, this should not be allowed to delay the processing
of HTTP 1.1 further.

> The intent was not to establish a monopoly on "preferred MIME names",
> and not to reinterpret any of the charset registrations, but just to
> pick the short names that are currently in use in HTTP.

The HTTP 1.1 draft talks about "preferred MIME names",
the charset registration draft requires "a single primary
name" if multiple names for the same character set are
registered. The current registry has one "Name:" field
for each character set, and zero or more "Alias:" fields.

I suppose that preferred MIME name = primary name = IANA-name.
In that case the changes section 19.8.1 in the HTTP draft
asks for are:
- register EUC-JP as the primary name for
  EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE
- make US-ASCII the primary name instead of ANSI_X3.4-1968
- make ISO-8859-1 the primary name instead of ISO_8859-1:1987
- make ISO-8859-2 the primary name instead of ISO_8859-2:1987
- make ISO-8859-3 the primary name instead of ISO_8859-3:1988
- make ISO-8859-4 the primary name instead of ISO_8859-4:1988
- make ISO-8859-5 the primary name instead of ISO_8859-5:1988
- make ISO-8859-6 the primary name instead of ISO_8859-6:1987
- make ISO-8859-7 the primary name instead of ISO_8859-7:1987
- make ISO-8859-8 the primary name instead of ISO_8859-8:1988
- make ISO-8859-9 the primary name instead of ISO_8859-9:1989

These changes would be improvements, in my opinion.

In the following cases nothing has to be done, because
these values are already the primary names of registered
character sets:
- ISO-2022-JP
- ISO-2022-JP-2
- ISO-2022-KR
- SHIFT_JIS
- GB231
- EUC-KR
- BIG5
- KOI-8R

/Olle

-- 
Olle Jarnefors, Royal Institute of Technology (KTH) <ojarnef@admin.kth.se>
Received on Wednesday, 10 July 1996 01:59:05 UTC