Re: ECMA-cyrillic alias iso-ir-111 sore from Martin Duerst on 2003-04-07 (ietf-charsets@w3.org from April to June 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 07 Apr 2003 13:26:26 -0400
To: msokolov@ivan.Harhan.ORG (Michael Sokolov), ietf-charsets@iana.org
Message-id: <4.2.0.58.J.20030407131141.026fcd50@localhost>
Hello Michael,

Some more comments below. The most important comment:

It would help everybody if you resent your mail with the proposals
at the beginning of the mail, and the justification afterwards
(i.e. US English order :-).


At 16:27 03/04/05 -0800, Michael Sokolov wrote:
>Charset gurus,
>
>There is a long-standing problem with one IANA-registered Latin/Cyrillic
>charset known as ISO-IR-111 or ECMA-Cyrillic. This charset is the European
>variant of the popular Soviet standard KOI-8. KOI-8 is a popular 8-bit Latin/
>Cyrillic charset whose low half is US-ASCII or ISO_646.irv:1983 (depending on
>your preference of the dollar sign or the international currency symbol at 
>code
>point 44 octal), whereas the high half has Russian letters at code points 300
>through 377 octal in the so-called KOI correspondence order (an order such 
>that
>if bit 7 of KOI-8 Russian text is stripped a case-reversed English
>transliteration is produced). The general term "KOI-8" (not registered with
>IANA) means the above, but says nothing about code points 200 through 277
>octal. Systems based on ISO standards generally interpret octets 200-237 octal
>as C1 high controls per ISO 6429. Octets 240-277 octal are supposed to be
>graphic (GR) characters in the ISO world, but the general term "KOI-8" leaves
>them undefined.
>
>The charset registered in the ISO International Register under No. 111
>(ECMA-Cyrillic) was ECMA's version of KOI-8 with Russian letter IO and
>Belorussian, Ukrainian, Serbocroatian, and Macedonian characters assigned to
>code points 240-277 octal which are left undefined by the general term 
>"KOI-8".
>A scanned image of the official (paper) registration document defining
>ISO-IR-111 can be found in:
>
>http://www.itscj.ipsj.or.jp/ISO-IR/111.pdf

Please note that this clearly says "Right hand part of the Cyrillic
Alphabet". While this is really strange (the Cyrillic alphabet doesn't
have hands), it intends to say that it defines only the right part
(i.e. hex 0x80-0xFF) of some actual encoding.


>Examination of the above document reveals that the charset registered in 
>ISO-IR
>under No. 111 is indeed as described above, a KOI-8 variant with the Russian
>letters in the KOI correspondence order. The problem is that the current IANA
>character-sets document lists RFC 1345 as the primary reference for this
>charset, and the description of this charset in RFC 1345 is seriously in 
>error.
>RFC 1345 lists the upper characters of ISO-IR-111 in a completely wrong order,
>effectively defining a totally different charset (a mix between 
>ISO_8859-5:1988
>and windows-1251 no less!).

RFC 1345 contains many other cases where only part of an actual encoding
is identified. It is unclear what these registrations (with labels mostly
of the form ISO-IR-foo) are actually standing for. There is a widespread
opinion that they as such should not be in the registry at all.


>Since the only Internet document describing charset ISO-IR-111 is the 
>erroneous
>RFC 1345 and since while acknowledging the ISO-IR registry as the original
>source the IANA character-sets document still lists RFC 1345 as reference with
>no warning about it being in error, it is certain that of the people
>implementing Internet charset handling software few have had any reason to 
>look
>at the ISO-IR registration document and most have instead logically assumed
>that RFC 1345 had the correct definition of ISO-IR-111. As a result it is
>certain that a great quantity of Internet software in use today interprets
>charset names "ECMA-Cyrillic" and "ISO-IR-111" as meaning the mix of
>ISO_8859-5:1988 and windows-1251 defined in RFC 1345 rather than the charset
>registered in ISO-IR under No. 111.

It is difficult to assert 'great quantity'. What would be helpful is to
have at least one example each of:
- Software implementing ISO-IR-111 according to the official document
- Software implementing ISO-IR-111 according to RFC 1345

Do you know such examples?


>This situation creates a problems for people wishing to use the charset
>registered in ISO-IR under No. 111 on the Internet. While ISO_8859-5:1988 is
>the current international standard (the current Russian Federation GOST
>standard is similar) and places the Russian letters in their native alphabetic
>order, the older KOI-8 standard is still popular in many environments. The
>people's love of KOI-8 no matter what the current standards say is the reason
>why most of the Internet today uses KOI8-R charset (RFC 1489) for Russian 
>text.
>
>However, KOI8-R has a feature making it unsuitable for some environments.
>Specifically, KOI8-R defines code points 200-237 octal as graphic characters,
>and such use of these code points cannot be correctly handled by terminal
>equipment (e.g. DEC VT300 terminal series) and text processing software (e.g.
>the terminal drivers and text editors in some versions of UNIX) designed for
>the ISO world in which these code points are ISO 6429 control characters.
>People using such equipment and software and wishing to use KOI-8 must use a
>version of KOI-8 other than KOI8-R. Such people naturally want a charset with
>code points 0-177 octal matching US-ASCII or ISO_646.irv:1983, code points
>200-237 octal being C1 controls of ISO 6429, and 300-377 octal being Russian
>letters in KOI correspondence order.
>
>What should be at code points 240-277 octal? In practice people who just want
>KOI-8 don't really care, but since it usually feels better to assign a rarely
>used code point to something rather than leave it completely undefined, 
>since a
>handy assignment of these code points exists in ISO-IR-111, and since those
>extra characters may come useful to some people, ISO-IR-111 is naturally the
>KOI-8 variant of choice for the people in circumstances described above.
>
>This is the motivation behind the desire to use ISO-IR-111 instead of
>ISO_8859-5:1988 or KOI8-R. However, in applications involving Internet
>protocols the problem arises of how to label the use of this charset given the
>current confused status of its IANA registration. To mend this problem I ask
>IANA to take the following corrective actions:
>
>1. Amend the character-sets document to not list RFC 1345 as a reference for
>charset ECMA-cyrillic alias iso-ir-111. List the ISO-IR registry as the only
>reference and add a note indicating that RFC 1345 is in error.
>
>2. Register KOI8-E as an alias for charset ECMA-cyrillic alias iso-ir-111. The
>reason for doing so is that so many people have been misled for so long into
>thinking that ECMA-Cyrillic aka ISO-IR-111 is the mix of ISO_8859-5:1988 and
>windows-1251 defined in RFC 1345 rather than the KOI-8 variant designed by 
>ECMA
>and defined in the ISO-IR registration document that the people wishing to use
>the latter charset naturally want a different name for it.

But just defining another alias doesn't solve the problem of differing
implementations. If we want to clear up things completely, a new registration
would be much better.



>I believe that it is
>best for the name to explicitly contain "KOI8" or "KOI-8" in it, and KOI8-E
>(for ECMA, European, or extended) is the name used by Roman Czyborra in his
>superb Cyrillic Charset Soup page:
>
>http://czyborra.com/charsets/cyrillic.html
>
>Thanks for reading and TIA for acting,
>
>MS


Regards,   Martin.
Received on Monday, 7 April 2003 14:09:59 UTC