- From: Michael Sokolov <msokolov@ivan.Harhan.ORG>
- Date: Sat, 05 Apr 2003 16:27:07 -0800 (PST)
- To: ietf-charsets@iana.org
Charset gurus, There is a long-standing problem with one IANA-registered Latin/Cyrillic charset known as ISO-IR-111 or ECMA-Cyrillic. This charset is the European variant of the popular Soviet standard KOI-8. KOI-8 is a popular 8-bit Latin/ Cyrillic charset whose low half is US-ASCII or ISO_646.irv:1983 (depending on your preference of the dollar sign or the international currency symbol at code point 44 octal), whereas the high half has Russian letters at code points 300 through 377 octal in the so-called KOI correspondence order (an order such that if bit 7 of KOI-8 Russian text is stripped a case-reversed English transliteration is produced). The general term "KOI-8" (not registered with IANA) means the above, but says nothing about code points 200 through 277 octal. Systems based on ISO standards generally interpret octets 200-237 octal as C1 high controls per ISO 6429. Octets 240-277 octal are supposed to be graphic (GR) characters in the ISO world, but the general term "KOI-8" leaves them undefined. The charset registered in the ISO International Register under No. 111 (ECMA-Cyrillic) was ECMA's version of KOI-8 with Russian letter IO and Belorussian, Ukrainian, Serbocroatian, and Macedonian characters assigned to code points 240-277 octal which are left undefined by the general term "KOI-8". A scanned image of the official (paper) registration document defining ISO-IR-111 can be found in: http://www.itscj.ipsj.or.jp/ISO-IR/111.pdf Examination of the above document reveals that the charset registered in ISO-IR under No. 111 is indeed as described above, a KOI-8 variant with the Russian letters in the KOI correspondence order. The problem is that the current IANA character-sets document lists RFC 1345 as the primary reference for this charset, and the description of this charset in RFC 1345 is seriously in error. RFC 1345 lists the upper characters of ISO-IR-111 in a completely wrong order, effectively defining a totally different charset (a mix between ISO_8859-5:1988 and windows-1251 no less!). Since the only Internet document describing charset ISO-IR-111 is the erroneous RFC 1345 and since while acknowledging the ISO-IR registry as the original source the IANA character-sets document still lists RFC 1345 as reference with no warning about it being in error, it is certain that of the people implementing Internet charset handling software few have had any reason to look at the ISO-IR registration document and most have instead logically assumed that RFC 1345 had the correct definition of ISO-IR-111. As a result it is certain that a great quantity of Internet software in use today interprets charset names "ECMA-Cyrillic" and "ISO-IR-111" as meaning the mix of ISO_8859-5:1988 and windows-1251 defined in RFC 1345 rather than the charset registered in ISO-IR under No. 111. This situation creates a problems for people wishing to use the charset registered in ISO-IR under No. 111 on the Internet. While ISO_8859-5:1988 is the current international standard (the current Russian Federation GOST standard is similar) and places the Russian letters in their native alphabetic order, the older KOI-8 standard is still popular in many environments. The people's love of KOI-8 no matter what the current standards say is the reason why most of the Internet today uses KOI8-R charset (RFC 1489) for Russian text. However, KOI8-R has a feature making it unsuitable for some environments. Specifically, KOI8-R defines code points 200-237 octal as graphic characters, and such use of these code points cannot be correctly handled by terminal equipment (e.g. DEC VT300 terminal series) and text processing software (e.g. the terminal drivers and text editors in some versions of UNIX) designed for the ISO world in which these code points are ISO 6429 control characters. People using such equipment and software and wishing to use KOI-8 must use a version of KOI-8 other than KOI8-R. Such people naturally want a charset with code points 0-177 octal matching US-ASCII or ISO_646.irv:1983, code points 200-237 octal being C1 controls of ISO 6429, and 300-377 octal being Russian letters in KOI correspondence order. What should be at code points 240-277 octal? In practice people who just want KOI-8 don't really care, but since it usually feels better to assign a rarely used code point to something rather than leave it completely undefined, since a handy assignment of these code points exists in ISO-IR-111, and since those extra characters may come useful to some people, ISO-IR-111 is naturally the KOI-8 variant of choice for the people in circumstances described above. This is the motivation behind the desire to use ISO-IR-111 instead of ISO_8859-5:1988 or KOI8-R. However, in applications involving Internet protocols the problem arises of how to label the use of this charset given the current confused status of its IANA registration. To mend this problem I ask IANA to take the following corrective actions: 1. Amend the character-sets document to not list RFC 1345 as a reference for charset ECMA-cyrillic alias iso-ir-111. List the ISO-IR registry as the only reference and add a note indicating that RFC 1345 is in error. 2. Register KOI8-E as an alias for charset ECMA-cyrillic alias iso-ir-111. The reason for doing so is that so many people have been misled for so long into thinking that ECMA-Cyrillic aka ISO-IR-111 is the mix of ISO_8859-5:1988 and windows-1251 defined in RFC 1345 rather than the KOI-8 variant designed by ECMA and defined in the ISO-IR registration document that the people wishing to use the latter charset naturally want a different name for it. I believe that it is best for the name to explicitly contain "KOI8" or "KOI-8" in it, and KOI8-E (for ECMA, European, or extended) is the name used by Roman Czyborra in his superb Cyrillic Charset Soup page: http://czyborra.com/charsets/cyrillic.html Thanks for reading and TIA for acting, MS
Received on Saturday, 5 April 2003 19:31:26 UTC