- From: Michael Sokolov <msokolov@ivan.Harhan.ORG>
- Date: Mon, 07 Apr 2003 23:10:13 -0700 (PDT)
- To: ietf-charsets@iana.org
Martin Duerst <duerst@w3.org> wrote: > It would help everybody if you resent your mail with the proposals > at the beginning of the mail, and the justification afterwards > (i.e. US English order :-). OK, here are my two proposals again: 1. Amend the character-sets document to not list RFC 1345 as a reference for charset ECMA-cyrillic alias iso-ir-111. List the ISO-IR registry as the only reference and add a note indicating that RFC 1345 is in error. 2. Register KOI8-E as an alias for charset ECMA-cyrillic alias iso-ir-111. The reasons are in my original post and more below. > Please note that this clearly says "Right hand part of the Cyrillic > Alphabet". While this is really strange (the Cyrillic alphabet doesn't > have hands), it intends to say that it defines only the right part > (i.e. hex 0x80-0xFF) of some actual encoding. Code points 0x00-0x7F (or 0-177 octal) coincide with US-ASCII. The ISO 2022 model defines ALL charsets by halves. > RFC 1345 contains many other cases where only part of an actual encoding > is identified. I think you've missed my point. The discrepancy I'm talking about is not whether the low US-ASCII half is spelled out or silently implied. It's the meat, the right Cyrillic part that is listed completely incorrectly in RFC 1345. The actual charset registered with ISO-IR under No. 111 has lowercase Cyrillic letters in ranges 240-257 and 300-337 octal and uppercase ones in 260-277 and 340-377 octal, RFC 1345 lists them the other way around. The actual charset has Russian letters in KOI correspondence order, RFC 1345 lists them in alphabetical order. The actual charset has the Balkan DJE and GJE before the Russian IO, RFC 1345 lists them the other way around. This is the problem I'm talking about. I don't see any problem with the "part of an actual encoding" issue: it makes absolutely no difference whether the low US-ASCII half is spelled out ad nauseum for every 8-bit charset or simply referenced once as the default. > It is unclear what these registrations (with labels mostly > of the form ISO-IR-foo) are actually standing for. No, except for the completely busted 111 I'd say the rest are perfectly fine and clear. (Actually there is one more blackeye in the Cyrillic charset arena, but there it was the [inter]national standards bodies themselves that goofed, not the Internet folks. The charset registered with ISO-IR under No. 153 is labeled GOST_19768-74 but is actually GOST_19768-87. 19768-74 was the original KOI-8 standard. But here I'm not blaming IANA or Keld Simonsen or IETF or whomever, as it was either GOST or ISO clerks that goofed here: the 153 registration document says GOST 19768-74 on it, even though it clearly defines -87 and not -74. Moan.) > It is difficult to assert 'great quantity'. OK, the "great quantity" was a logical guess on my part. But I just did a tiny bit of actual research: > What would be helpful is to > have at least one example each of: > - Software implementing ISO-IR-111 according to the official document GNU recode 3.5. > - Software implementing ISO-IR-111 according to RFC 1345 GNU recode 3.4. > But just defining another alias doesn't solve the problem of differing > implementations. Well, if the new alias is published simultaneously with the note in the official character-sets document explaining what the correct charset really is, all implementations knowing the new alias would necessarily be new ones that implement the charset correctly. Old implementations would not recognize the new name at all. (If someone takes the trouble of adding the new name to old software s/he will necessarily notice the correction in the charset definition and hopefully not produce a program that interprets the new name as meaning the bogus RFC 1345 definition.) But perhaps an even more important reason for registering the name KOI8-E as an alias for ECMA-Cyrillic is that it's much more descriptive. Assume for the moment that in a given system the recognition of charsets is left up to the human user, as with a user manually looking at Content-Type: headers in a non- MIME mailer. (Or the software implements the charset incorrectly per RFC 1345 and is forced into manual mode by using an alias it doesn't recognize.) When seen by a human user familiar with charset basics but not with the full ugly story, the name "ECMA-Cyrillic" produces a kneejerk reaction "what's that?", while the kneejerk reaction to "KOI8-E" would be "ahh, it's another KOI-8 variant". See the difference? I would much rather get the latter reaction. With that reaction the silly mistake of RFC 1345 would probably have never happened in the first place, it certainly resulted from the former reaction. > If we want to clear up things completely, a new registration > would be much better. It would be fine with me, but what about IANA? The charset registration procedure does not invent new charsets, it merely catalogs ones invented by others. So however it's registered with IANA, the actual charset (the normative reference) is the ISO-IR document. We have an IANA registration for this charset. A troubled one, but existing nonetheless. How can you have two independent IANA registrations for one actual charset (one normative reference)? Or actually you can, and it's called an alias. That's what I was getting at. MS
Received on Tuesday, 8 April 2003 02:14:50 UTC