Unregistered charset values in HTTP 1.1, the ISO-8859-* values from Olle Jarnefors on 1996-07-08 (ietf-http-wg@w3.org from July to September 1996)

From: Olle Jarnefors <ojarnef@admin.kth.se>
Date: Mon, 8 Jul 96 19:46:05 +0200
To: iesg@ietf.org
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, Olle Jarnefors <ojarnef@admin.kth.se>
Message-Id: <9607081746.AA19709@mercutio.admin.kth.se>
Two more charset-related issues in the HTTP 1.1 draft:


1)  Unregistered charset values

In draft-ietf-http-v11-spec-05.txt is said:

> 3.4 Character Sets

> Although HTTP allows an arbitrary token to be used as a charset value,
> any token that has a predefined value within the IANA Character Set
> registry MUST represent the character set defined by that registry.

My reading of this is that, in HTTP 1.1, any charset
value not registered with IANA can be used for
non-registered character sets that can be used after
private agreement. In my view this is an unnecessary
deviation from the long-established rules for MIME, that
private charset values must start with "x-". According to
draft-ietf-822ext-mime-imt-05.txt:

: No character set name other than those defined above may be
: used in Internet mail without the publication of a formal
: specification and its registration with IANA, or by private
: agreement, in which case the character set name must begin
: with "X-".


2) Preferred MIME names for ISO-8859 character sets

The HTTP 1.1 draft doesn't explicitly recommend the use
of any particular subset of the more than 200 character
sets registered with IANA. I suppose that this decision
(which I don't believe is wise) is the result of earlier
WG discussions, which I unfortunately have missed.

However, the draft seems to indirectly give preference to
some of the more popular or well-defined character sets
in this section:

> 19.8.1 Charset Registry
> 
> The following names should be added to the IANA character set registry
> under the category "Preferred MIME name" and this section deleted.
> 
>        "US-ASCII"
>        | "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3"
>        | "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6"
>        | "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9"
>        | "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR"
>        | "SHIFT_JIS" | "EUC-KR" | "GB2312" | "BIG5" | "KOI8-R"
> 
> Please also add the following new alias as the "preferred MIME name":
> 
>        "EUC-JP" for "EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE"

There are some problematic cases in this list:

a) "ISO-8859-10" is left out, though this part of ISO 8859
   was adopted as early as in 1992.

b) It's unclear what charset registration the preferred
   MIME name "GB2312" shall designate: the ISO-registered
   character set GB_2312-80 (MIBenum: 57) or the only
   incompletely described GB2312 (MIBenum: 2025), which
   of course already has the proposed preferred MIME name
   as it's principal name. It is also possible that these
   two registrations actually refer to exactly the same
   character set, and should be merged in the IANA registry.

c) The meanings of the two charset values ISO_8859-6:1987
   (ISO-8859-6) and ISO_8859-8:1988 (ISO-8859-8) is not
   clear. The ISO standards are silent on the question of
   in which order the right-to-left characters of these
   standards should be coded in a string.

   RFC 1556 explains that three different interpretations
   are possible: visual order, explicit logical order,
   and implicit logical order. It proposes that the
   values "ISO-8859-6" and "ISO-8859-8" shall be used for
   text coded in visual order, that "ISO-8859-6-E" and
   "ISO-8859-8-E" shall be used for text coded in
   explicit order, and "ISO-8859-6-I" and "ISO-8859-8-I"
   for text coded in implicit order. The MIME draft 
   draft-ietf-822ext-mime-imt-05.txt follows RFC 1556
   regarding the meanings of "ISO-8859-6" and
   "ISO-8859-8".

   Current practice is more complex, unfortunately, which
   is clear from a recent message to the Hebrew-oriented
   mailing list ILAN-H, enclosed at the
   end of this message.

   For Arabic, visual order and explicit logical order
   are seldom used. Normally, implicit logical order is
   used, and the charset label then is "ISO-8859-6", not
   "ISO-8859-6-I".

   In Hebrew, on the other hand, "ISO-8859-8" usually has
   the meaning of visual order, while the more popular
   implicit order (as defined by the Unicode bidi
   algorithm) is indicated by the "ISO-8859-8-I" charset
   value.

   It should also be noted that the current draft for
   internationalization of HTML, draft-ietf-html-i18n-04.txt,
   only specifies implicit order for bi-directional text.

   I would recommend that current practice rather than
   RFC theory is followed. If the indirect way of
   favouring certain character sets for HTTP - to let
   IANA assign "preferred MIME names" only for a few of
   all the registered charset values - is followed,
   "ISO-8859-6" should be retained as such a name, but
   "ISO-8859-8" should be replaced by "ISO-8859-8-I".

/Olle

-- 
Olle Jarnefors, Royal Institute of Technology (KTH) <ojarnef@admin.kth.se>

Included message from the ILAN-H mailing list:

Date: Fri, 5 Jul 1996 21:04:43 -0500 (CDT)
From: Alexandre Khalil <iskandar@ee.tamu.edu>
Reply-To: Alexandre Khalil <iskandar@ee.tamu.edu>
To: ILAN-H  Discussion in and about Hebrew in the network
 <ILAN-H@taunivm.tau.ac.il>
Cc: Arabic script mailing list <reader@leb.net>,
 "ITISALAT: IT IS Arabic Language And Technology."
 <itisalat@listserv.georgetown.edu>
Subject: Re: RFC 1556
In-Reply-To: <ILAN-H%96070415340381@VM.TAU.AC.IL>
Message-Id: <Pine.GSO.3.93.960705204813.3148r-100000@ee.tamu.edu>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-ilan-h@VM.TAU.AC.IL

On Thu, 4 Jul 1996, Hank Nussbacher wrote:

>On Thu, 4 Jul 1996 14:21:52 +0200 you said:

>>Uri Bruck wrote:

[...]

>>> Second, at least for ISO-8859-6, visual directionality is rarely, if ever,
>>> used, and ISO-8859-6 is taken to mean Arabic in implicit directionality.

>>I assume you are talking about the use of "iso-8859-6" as
>>a charset parameter in e-mail according to MIME here.
>>Does also a de-facto standard for directionality in the
>>encoding of Hebrew text with charset value "iso-8859-8"
>>exist? Is it the same as for "iso-8859-6"?

>>If this is so, I would suggest that the revised RFC 1556
>>should document the de facto standards, and introduce new
>>values "iso-8859-6v" and "iso-8859-8v" for visual
>>directionality.

[...]

>Incidentally, iso-8859-8 == iso-8859-8-v so no need for a visual
>charset.  After knocking this subject around for a few more weeks,
>would anyone like to volunteer to amend RFC1556?  I am a bit busy
>and will only do it/get to it if no one else steps forward (probably
>during August).

  In summary

        iso-8859-8 == iso-8859-8-v          as Hank said
and
        iso-8859-6 == iso-8859-6-i          as Uri pointed out


  Also, visual and explicit encoding for iso-8859-6 have practically not
been in use and with the appearance of better multi-script software such
AccentSoft's and Alis that supports UTF/8 and its explicit directional
tagging, these along with iso-8859-e might never see significant usage.

  Shouldn't we promote Unicode in its various avatars or is there still a
need to finetune 8 bit encodings?

alex
Received on Monday, 8 July 1996 10:52:07 UTC