Re: character encoding ISO 2022 from Mark Reichert on 2002-03-13 (www-zig@w3.org from March 2002)

From: Mark Reichert <markr@sirs.com>
Date: Wed, 13 Mar 2002 08:33:37 -0500
To: "Z39.50 LISTSERV" <www-zig@w3.org>
Cc: "Pieter Van Lierop" <pvanlierop@geac.fr>
Message-ID: <012d01c1ca93$ae5cb8c0$c8aca2cd@MANPROMARKR>
ISO 2022 is an escaping mechanism for switching between character sets.
Most internationally defined "standard" character sets pre-dating ISO
10646/Unicode adhere to the rules ISO 2022 specifies for defining a
character set.  ISO 2022 is the default character encoding scheme for Z39.50
InternationalStrings (ASN.1 string types encoded in BER), in the absence of
character set negotation.  ISO 2022 allows for use of any registered
character set and for privately defined character sets.  Character sets may
contain graphic characters or control characters.  Graphic character sets
may contain up to 94 or 96 characters or 94^n or 96^n characters where n is
the number of bytes per character.  Control character sets can contain 32
characters.  In the IANA charset world, ISO 2022 character sets are referred
to as ISO IR n where n is the registration number assigned to the character
set by ECMA, maintainer of the registry, though IANA couples the named
graphic character set with other numbered graphic and control character
sets.  In addition to a registry number, character sets also have one or
more bytes associated with them that identify the character set in an ISO
2022 escape sequence.  An ISO 2022 escape sequence can be specified by the
regular expression [1B][20-2F]*[30-7E].  ISO 2022 also defines semantics for
SHIFT IN (0F) and SHIFT OUT (0E).

ISO 2022 already defines a rich OID syntax for fully specifying character
set usage and encodings (all character sets in use, which registers are
initially being used, etc.).  Z39.50 chose to ignore these OIDs in its
character set negotiation definition, though having the information
structured as in Z39.50 character set negotation is a little more
user-friendly than assigning all the correct semantics to the individual
components of ISO 2022 OIDs.  Assigning OIDs to all combinations of
character sets in ISO 2022 would generate a very, very long, if not
infinite, list of OIDs.

There are currently approximately 23 registered control character sets, 109
registered single-byte graphic character sets, 18 multi-byte graphic
character sets (all double-byte).  The registry can be viewed at
http://www.itscj.ipsj.or.jp/ISO-IR/.  (Most "MARC" syntaxes also have
character sets based on ISO 2022.  USMARC/MARC21 is encoded using a mix of
two control character sets, single-byte graphic character sets, and the
triple-byte CJK/EACC graphic character set.  Most character sets are
privately defined by the MARC standard.  The MARC21 standard veers somewhat
from standard ISO 2022 in some of its escape sequence usage.  To be fair,
MARC21 references ANSI X3.41 not ISO 2022.)

What is typically referred to as US-ASCII is officially made up of two
character sets, one 32-character control set and one 94-character graphic
set, the two remaining characters being SPACE (20) and DELETE (7F) which are
in place when 94-character graphic sets are used in ISO 2022's GL (graphic
left register).

ISO 2022 also allows for registration of "coding systems" that don't adhere
to the "rules" of ISO 2022.  Some of these character sets can be switched to
with no standard way to return to ISO 2022; others have a standard return.
There are approximately 21 character sets of this type, many of them various
encodings of Unicode/ISO 10646.

The "rules" of ISO 2022 are way too complicated to explain briefly.  A free,
identical version of ISO 2022 can be downloaded at
ftp://ftp.ecma.ch/ecma-st/Ecma-035.pdf.  Essentially ISO 2022 defines six
registers, four for graphic character sets (G0-G3), two for control
character sets (C0-C1).  In 8-bit encoding up to four registers can be used
simultaneously.  The four registers correspond to the byte ranges 00-1F,
20-7F, 80-9F, A0-FF.  These are named CL (control left), GL (graphic left),
CR (control right), GR (graphic right), respectively.  So in 8-bit encoding
two control character sets and two graphic character sets can be used
simultaneously.  ISO 2022 defines many escape sequences, the most important
of which are loading character sets into registers, shifting registers, etc.
 For 7-bit encoding, escape sequences are defined for shifting registers
accordingly so character encodings fall into the two byte ranges 00-1F,
20-7F.

In Z39.50 version 3 implementors are using GeneralString semantics for
InternationalString, in the absence of character set negotiation.  As such,
one is essentially starting in ASCII (ISO IR 1 in C0/CL, ISO IR 6 (or 2
depending on which year's ASN.1 and BER standards one is reading) in G0/GL).
GeneralString allows for escape sequences and use of all control and graphic
character sets.  Whether or not switching to other coding systems (like
UTF-8) is allowable in GeneralStrings depends on one's reading of ASN.1 and
BER, but really the standard says: All G (graphic) and all C (control) sets
+ SPACE + DELETE.  There is no mention of other coding systems.  Switching
character sets/loading other character sets is possible at any time using
the appropriate escape sequences.

In Z39.50 version 2, VisibleString semantics are in place for
InternationalString.  VisibleString allows for ISO IR 6 (or 2 depending on
which year's ASN.1 and BER standards one is reading) in G0/GL, and SPACE.
No control characters or escape sequences are allowed.



----- Original Message -----
From: "Pieter Van Lierop" <pvanlierop@geac.fr>
To: "zig" <www-zig@w3.org>
Sent: Wednesday, March 13, 2002 5:28 AM
Subject: character encoding ISO 2022


> Please forgive my ignorance but what is ISO 2022 exactly?
>
> The choice in the character set negotiation is between:
> ISO2022
> ISO10646
> Private
>
> ISO2022, as I understand it, is an encapsulation of all classic 7-bits and
> 8-bits character sets.
> How many applications use ISO2022?
> How do I say I send Ascii, or Latin-1?
>
> Wouldn't it be better, instead of ISO 2022, to make a list (extendable) of
> character sets used? We could give them OID's.
> I think we need the following:
>
> ASCII
> Extended ASCII
> ANSI
> ALA
> Latin-1
> Extended-Latin
> ...
>
> probably a few more
Received on Wednesday, 13 March 2002 08:33:55 UTC