- From: Olle Jarnefors <ojarnef@admin.kth.se>
- Date: Mon, 8 Jul 96 19:46:05 +0200
- To: iesg@ietf.org
- Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, Olle Jarnefors <ojarnef@admin.kth.se>
Two more charset-related issues in the HTTP 1.1 draft: 1) Unregistered charset values In draft-ietf-http-v11-spec-05.txt is said: > 3.4 Character Sets > Although HTTP allows an arbitrary token to be used as a charset value, > any token that has a predefined value within the IANA Character Set > registry MUST represent the character set defined by that registry. My reading of this is that, in HTTP 1.1, any charset value not registered with IANA can be used for non-registered character sets that can be used after private agreement. In my view this is an unnecessary deviation from the long-established rules for MIME, that private charset values must start with "x-". According to draft-ietf-822ext-mime-imt-05.txt: : No character set name other than those defined above may be : used in Internet mail without the publication of a formal : specification and its registration with IANA, or by private : agreement, in which case the character set name must begin : with "X-". 2) Preferred MIME names for ISO-8859 character sets The HTTP 1.1 draft doesn't explicitly recommend the use of any particular subset of the more than 200 character sets registered with IANA. I suppose that this decision (which I don't believe is wise) is the result of earlier WG discussions, which I unfortunately have missed. However, the draft seems to indirectly give preference to some of the more popular or well-defined character sets in this section: > 19.8.1 Charset Registry > > The following names should be added to the IANA character set registry > under the category "Preferred MIME name" and this section deleted. > > "US-ASCII" > | "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3" > | "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6" > | "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9" > | "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR" > | "SHIFT_JIS" | "EUC-KR" | "GB2312" | "BIG5" | "KOI8-R" > > Please also add the following new alias as the "preferred MIME name": > > "EUC-JP" for "EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE" There are some problematic cases in this list: a) "ISO-8859-10" is left out, though this part of ISO 8859 was adopted as early as in 1992. b) It's unclear what charset registration the preferred MIME name "GB2312" shall designate: the ISO-registered character set GB_2312-80 (MIBenum: 57) or the only incompletely described GB2312 (MIBenum: 2025), which of course already has the proposed preferred MIME name as it's principal name. It is also possible that these two registrations actually refer to exactly the same character set, and should be merged in the IANA registry. c) The meanings of the two charset values ISO_8859-6:1987 (ISO-8859-6) and ISO_8859-8:1988 (ISO-8859-8) is not clear. The ISO standards are silent on the question of in which order the right-to-left characters of these standards should be coded in a string. RFC 1556 explains that three different interpretations are possible: visual order, explicit logical order, and implicit logical order. It proposes that the values "ISO-8859-6" and "ISO-8859-8" shall be used for text coded in visual order, that "ISO-8859-6-E" and "ISO-8859-8-E" shall be used for text coded in explicit order, and "ISO-8859-6-I" and "ISO-8859-8-I" for text coded in implicit order. The MIME draft draft-ietf-822ext-mime-imt-05.txt follows RFC 1556 regarding the meanings of "ISO-8859-6" and "ISO-8859-8". Current practice is more complex, unfortunately, which is clear from a recent message to the Hebrew-oriented mailing list ILAN-H, enclosed at the end of this message. For Arabic, visual order and explicit logical order are seldom used. Normally, implicit logical order is used, and the charset label then is "ISO-8859-6", not "ISO-8859-6-I". In Hebrew, on the other hand, "ISO-8859-8" usually has the meaning of visual order, while the more popular implicit order (as defined by the Unicode bidi algorithm) is indicated by the "ISO-8859-8-I" charset value. It should also be noted that the current draft for internationalization of HTML, draft-ietf-html-i18n-04.txt, only specifies implicit order for bi-directional text. I would recommend that current practice rather than RFC theory is followed. If the indirect way of favouring certain character sets for HTTP - to let IANA assign "preferred MIME names" only for a few of all the registered charset values - is followed, "ISO-8859-6" should be retained as such a name, but "ISO-8859-8" should be replaced by "ISO-8859-8-I". /Olle -- Olle Jarnefors, Royal Institute of Technology (KTH) <ojarnef@admin.kth.se> Included message from the ILAN-H mailing list: Date: Fri, 5 Jul 1996 21:04:43 -0500 (CDT) From: Alexandre Khalil <iskandar@ee.tamu.edu> Reply-To: Alexandre Khalil <iskandar@ee.tamu.edu> To: ILAN-H Discussion in and about Hebrew in the network <ILAN-H@taunivm.tau.ac.il> Cc: Arabic script mailing list <reader@leb.net>, "ITISALAT: IT IS Arabic Language And Technology." <itisalat@listserv.georgetown.edu> Subject: Re: RFC 1556 In-Reply-To: <ILAN-H%96070415340381@VM.TAU.AC.IL> Message-Id: <Pine.GSO.3.93.960705204813.3148r-100000@ee.tamu.edu> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-ilan-h@VM.TAU.AC.IL On Thu, 4 Jul 1996, Hank Nussbacher wrote: >On Thu, 4 Jul 1996 14:21:52 +0200 you said: >>Uri Bruck wrote: [...] >>> Second, at least for ISO-8859-6, visual directionality is rarely, if ever, >>> used, and ISO-8859-6 is taken to mean Arabic in implicit directionality. >>I assume you are talking about the use of "iso-8859-6" as >>a charset parameter in e-mail according to MIME here. >>Does also a de-facto standard for directionality in the >>encoding of Hebrew text with charset value "iso-8859-8" >>exist? Is it the same as for "iso-8859-6"? >>If this is so, I would suggest that the revised RFC 1556 >>should document the de facto standards, and introduce new >>values "iso-8859-6v" and "iso-8859-8v" for visual >>directionality. [...] >Incidentally, iso-8859-8 == iso-8859-8-v so no need for a visual >charset. After knocking this subject around for a few more weeks, >would anyone like to volunteer to amend RFC1556? I am a bit busy >and will only do it/get to it if no one else steps forward (probably >during August). In summary iso-8859-8 == iso-8859-8-v as Hank said and iso-8859-6 == iso-8859-6-i as Uri pointed out Also, visual and explicit encoding for iso-8859-6 have practically not been in use and with the appearance of better multi-script software such AccentSoft's and Alis that supports UTF/8 and its explicit directional tagging, these along with iso-8859-e might never see significant usage. Shouldn't we promote Unicode in its various avatars or is there still a need to finetune 8 bit encodings? alex
Received on Monday, 8 July 1996 10:52:07 UTC