Re: ECMA-cyrillic alias iso-ir-111 sore from Paul Hoffman / IMC on 2004-01-06 (ietf-charsets@w3.org from January to March 2004)

From: Paul Hoffman / IMC <phoffman@imc.org>
Date: Tue, 06 Jan 2004 09:44:10 -0800
To: msokolov@ivan.Harhan.ORG (Michael Sokolov), ietf-charsets@iana.org
Message-id: <p0602046bbc20a3ef6103@[63.202.92.157]>
Wearing my charset reviewer hat:

After reading the discussion, I see that this is probably a real 
issue. However, it is hard to judge whether there would be agreement 
here without a concrete proposal. Could you (Michael) take the 
current registration from the IANA site, modify it to say what you 
think is more accurate, and post the proposed new registration *and* 
a list of differences from old to new on the mailing list? If there 
is general agreement with your specific new registration, I'll submit 
it to IANA.

--Paul Hoffman

At 4:27 PM -0800 4/5/03, Michael Sokolov wrote:
>Charset gurus,
>
>There is a long-standing problem with one IANA-registered Latin/Cyrillic
>charset known as ISO-IR-111 or ECMA-Cyrillic. This charset is the European
>variant of the popular Soviet standard KOI-8. KOI-8 is a popular 8-bit Latin/
>Cyrillic charset whose low half is US-ASCII or ISO_646.irv:1983 (depending on
>your preference of the dollar sign or the international currency 
>symbol at code
>point 44 octal), whereas the high half has Russian letters at code points 300
>through 377 octal in the so-called KOI correspondence order (an 
>order such that
>if bit 7 of KOI-8 Russian text is stripped a case-reversed English
>transliteration is produced). The general term "KOI-8" (not registered with
>IANA) means the above, but says nothing about code points 200 through 277
>octal. Systems based on ISO standards generally interpret octets 200-237 octal
>as C1 high controls per ISO 6429. Octets 240-277 octal are supposed to be
>graphic (GR) characters in the ISO world, but the general term "KOI-8" leaves
>them undefined.
>
>The charset registered in the ISO International Register under No. 111
>(ECMA-Cyrillic) was ECMA's version of KOI-8 with Russian letter IO and
>Belorussian, Ukrainian, Serbocroatian, and Macedonian characters assigned to
>code points 240-277 octal which are left undefined by the general 
>term "KOI-8".
>A scanned image of the official (paper) registration document defining
>ISO-IR-111 can be found in:
>
>http://www.itscj.ipsj.or.jp/ISO-IR/111.pdf
>
>Examination of the above document reveals that the charset 
>registered in ISO-IR
>under No. 111 is indeed as described above, a KOI-8 variant with the Russian
>letters in the KOI correspondence order. The problem is that the current IANA
>character-sets document lists RFC 1345 as the primary reference for this
>charset, and the description of this charset in RFC 1345 is 
>seriously in error.
>RFC 1345 lists the upper characters of ISO-IR-111 in a completely wrong order,
>effectively defining a totally different charset (a mix between 
>ISO_8859-5:1988
>and windows-1251 no less!).
>
>Since the only Internet document describing charset ISO-IR-111 is 
>the erroneous
>RFC 1345 and since while acknowledging the ISO-IR registry as the original
>source the IANA character-sets document still lists RFC 1345 as reference with
>no warning about it being in error, it is certain that of the people
>implementing Internet charset handling software few have had any 
>reason to look
>at the ISO-IR registration document and most have instead logically assumed
>that RFC 1345 had the correct definition of ISO-IR-111. As a result it is
>certain that a great quantity of Internet software in use today interprets
>charset names "ECMA-Cyrillic" and "ISO-IR-111" as meaning the mix of
>ISO_8859-5:1988 and windows-1251 defined in RFC 1345 rather than the charset
>registered in ISO-IR under No. 111.
>
>This situation creates a problems for people wishing to use the charset
>registered in ISO-IR under No. 111 on the Internet. While ISO_8859-5:1988 is
>the current international standard (the current Russian Federation GOST
>standard is similar) and places the Russian letters in their native alphabetic
>order, the older KOI-8 standard is still popular in many environments. The
>people's love of KOI-8 no matter what the current standards say is the reason
>why most of the Internet today uses KOI8-R charset (RFC 1489) for 
>Russian text.
>
>However, KOI8-R has a feature making it unsuitable for some environments.
>Specifically, KOI8-R defines code points 200-237 octal as graphic characters,
>and such use of these code points cannot be correctly handled by terminal
>equipment (e.g. DEC VT300 terminal series) and text processing software (e.g.
>the terminal drivers and text editors in some versions of UNIX) designed for
>the ISO world in which these code points are ISO 6429 control characters.
>People using such equipment and software and wishing to use KOI-8 must use a
>version of KOI-8 other than KOI8-R. Such people naturally want a charset with
>code points 0-177 octal matching US-ASCII or ISO_646.irv:1983, code points
>200-237 octal being C1 controls of ISO 6429, and 300-377 octal being Russian
>letters in KOI correspondence order.
>
>What should be at code points 240-277 octal? In practice people who just want
>KOI-8 don't really care, but since it usually feels better to assign a rarely
>used code point to something rather than leave it completely 
>undefined, since a
>handy assignment of these code points exists in ISO-IR-111, and since those
>extra characters may come useful to some people, ISO-IR-111 is naturally the
>KOI-8 variant of choice for the people in circumstances described above.
>
>This is the motivation behind the desire to use ISO-IR-111 instead of
>ISO_8859-5:1988 or KOI8-R. However, in applications involving Internet
>protocols the problem arises of how to label the use of this charset given the
>current confused status of its IANA registration. To mend this problem I ask
>IANA to take the following corrective actions:
>
>1. Amend the character-sets document to not list RFC 1345 as a reference for
>charset ECMA-cyrillic alias iso-ir-111. List the ISO-IR registry as the only
>reference and add a note indicating that RFC 1345 is in error.
>
>2. Register KOI8-E as an alias for charset ECMA-cyrillic alias iso-ir-111. The
>reason for doing so is that so many people have been misled for so long into
>thinking that ECMA-Cyrillic aka ISO-IR-111 is the mix of ISO_8859-5:1988 and
>windows-1251 defined in RFC 1345 rather than the KOI-8 variant 
>designed by ECMA
>and defined in the ISO-IR registration document that the people wishing to use
>the latter charset naturally want a different name for it. I believe 
>that it is
>best for the name to explicitly contain "KOI8" or "KOI-8" in it, and KOI8-E
>(for ECMA, European, or extended) is the name used by Roman Czyborra in his
>superb Cyrillic Charset Soup page:
>
>http://czyborra.com/charsets/cyrillic.html
>
>Thanks for reading and TIA for acting,
>
>MS


--Paul Hoffman, Director
--Internet Mail Consortium
Received on Tuesday, 6 January 2004 12:44:20 UTC