- From: Paul Hoffman / IMC <phoffman@imc.org>
- Date: Tue, 06 Jan 2004 09:44:10 -0800
- To: msokolov@ivan.Harhan.ORG (Michael Sokolov), ietf-charsets@iana.org
Wearing my charset reviewer hat: After reading the discussion, I see that this is probably a real issue. However, it is hard to judge whether there would be agreement here without a concrete proposal. Could you (Michael) take the current registration from the IANA site, modify it to say what you think is more accurate, and post the proposed new registration *and* a list of differences from old to new on the mailing list? If there is general agreement with your specific new registration, I'll submit it to IANA. --Paul Hoffman At 4:27 PM -0800 4/5/03, Michael Sokolov wrote: >Charset gurus, > >There is a long-standing problem with one IANA-registered Latin/Cyrillic >charset known as ISO-IR-111 or ECMA-Cyrillic. This charset is the European >variant of the popular Soviet standard KOI-8. KOI-8 is a popular 8-bit Latin/ >Cyrillic charset whose low half is US-ASCII or ISO_646.irv:1983 (depending on >your preference of the dollar sign or the international currency >symbol at code >point 44 octal), whereas the high half has Russian letters at code points 300 >through 377 octal in the so-called KOI correspondence order (an >order such that >if bit 7 of KOI-8 Russian text is stripped a case-reversed English >transliteration is produced). The general term "KOI-8" (not registered with >IANA) means the above, but says nothing about code points 200 through 277 >octal. Systems based on ISO standards generally interpret octets 200-237 octal >as C1 high controls per ISO 6429. Octets 240-277 octal are supposed to be >graphic (GR) characters in the ISO world, but the general term "KOI-8" leaves >them undefined. > >The charset registered in the ISO International Register under No. 111 >(ECMA-Cyrillic) was ECMA's version of KOI-8 with Russian letter IO and >Belorussian, Ukrainian, Serbocroatian, and Macedonian characters assigned to >code points 240-277 octal which are left undefined by the general >term "KOI-8". >A scanned image of the official (paper) registration document defining >ISO-IR-111 can be found in: > >http://www.itscj.ipsj.or.jp/ISO-IR/111.pdf > >Examination of the above document reveals that the charset >registered in ISO-IR >under No. 111 is indeed as described above, a KOI-8 variant with the Russian >letters in the KOI correspondence order. The problem is that the current IANA >character-sets document lists RFC 1345 as the primary reference for this >charset, and the description of this charset in RFC 1345 is >seriously in error. >RFC 1345 lists the upper characters of ISO-IR-111 in a completely wrong order, >effectively defining a totally different charset (a mix between >ISO_8859-5:1988 >and windows-1251 no less!). > >Since the only Internet document describing charset ISO-IR-111 is >the erroneous >RFC 1345 and since while acknowledging the ISO-IR registry as the original >source the IANA character-sets document still lists RFC 1345 as reference with >no warning about it being in error, it is certain that of the people >implementing Internet charset handling software few have had any >reason to look >at the ISO-IR registration document and most have instead logically assumed >that RFC 1345 had the correct definition of ISO-IR-111. As a result it is >certain that a great quantity of Internet software in use today interprets >charset names "ECMA-Cyrillic" and "ISO-IR-111" as meaning the mix of >ISO_8859-5:1988 and windows-1251 defined in RFC 1345 rather than the charset >registered in ISO-IR under No. 111. > >This situation creates a problems for people wishing to use the charset >registered in ISO-IR under No. 111 on the Internet. While ISO_8859-5:1988 is >the current international standard (the current Russian Federation GOST >standard is similar) and places the Russian letters in their native alphabetic >order, the older KOI-8 standard is still popular in many environments. The >people's love of KOI-8 no matter what the current standards say is the reason >why most of the Internet today uses KOI8-R charset (RFC 1489) for >Russian text. > >However, KOI8-R has a feature making it unsuitable for some environments. >Specifically, KOI8-R defines code points 200-237 octal as graphic characters, >and such use of these code points cannot be correctly handled by terminal >equipment (e.g. DEC VT300 terminal series) and text processing software (e.g. >the terminal drivers and text editors in some versions of UNIX) designed for >the ISO world in which these code points are ISO 6429 control characters. >People using such equipment and software and wishing to use KOI-8 must use a >version of KOI-8 other than KOI8-R. Such people naturally want a charset with >code points 0-177 octal matching US-ASCII or ISO_646.irv:1983, code points >200-237 octal being C1 controls of ISO 6429, and 300-377 octal being Russian >letters in KOI correspondence order. > >What should be at code points 240-277 octal? In practice people who just want >KOI-8 don't really care, but since it usually feels better to assign a rarely >used code point to something rather than leave it completely >undefined, since a >handy assignment of these code points exists in ISO-IR-111, and since those >extra characters may come useful to some people, ISO-IR-111 is naturally the >KOI-8 variant of choice for the people in circumstances described above. > >This is the motivation behind the desire to use ISO-IR-111 instead of >ISO_8859-5:1988 or KOI8-R. However, in applications involving Internet >protocols the problem arises of how to label the use of this charset given the >current confused status of its IANA registration. To mend this problem I ask >IANA to take the following corrective actions: > >1. Amend the character-sets document to not list RFC 1345 as a reference for >charset ECMA-cyrillic alias iso-ir-111. List the ISO-IR registry as the only >reference and add a note indicating that RFC 1345 is in error. > >2. Register KOI8-E as an alias for charset ECMA-cyrillic alias iso-ir-111. The >reason for doing so is that so many people have been misled for so long into >thinking that ECMA-Cyrillic aka ISO-IR-111 is the mix of ISO_8859-5:1988 and >windows-1251 defined in RFC 1345 rather than the KOI-8 variant >designed by ECMA >and defined in the ISO-IR registration document that the people wishing to use >the latter charset naturally want a different name for it. I believe >that it is >best for the name to explicitly contain "KOI8" or "KOI-8" in it, and KOI8-E >(for ECMA, European, or extended) is the name used by Roman Czyborra in his >superb Cyrillic Charset Soup page: > >http://czyborra.com/charsets/cyrillic.html > >Thanks for reading and TIA for acting, > >MS --Paul Hoffman, Director --Internet Mail Consortium
Received on Tuesday, 6 January 2004 12:44:20 UTC