- From: Mark Reichert <markr@sirs.com>
- Date: Wed, 13 Mar 2002 08:33:37 -0500
- To: "Z39.50 LISTSERV" <www-zig@w3.org>
- Cc: "Pieter Van Lierop" <pvanlierop@geac.fr>
ISO 2022 is an escaping mechanism for switching between character sets. Most internationally defined "standard" character sets pre-dating ISO 10646/Unicode adhere to the rules ISO 2022 specifies for defining a character set. ISO 2022 is the default character encoding scheme for Z39.50 InternationalStrings (ASN.1 string types encoded in BER), in the absence of character set negotation. ISO 2022 allows for use of any registered character set and for privately defined character sets. Character sets may contain graphic characters or control characters. Graphic character sets may contain up to 94 or 96 characters or 94^n or 96^n characters where n is the number of bytes per character. Control character sets can contain 32 characters. In the IANA charset world, ISO 2022 character sets are referred to as ISO IR n where n is the registration number assigned to the character set by ECMA, maintainer of the registry, though IANA couples the named graphic character set with other numbered graphic and control character sets. In addition to a registry number, character sets also have one or more bytes associated with them that identify the character set in an ISO 2022 escape sequence. An ISO 2022 escape sequence can be specified by the regular expression [1B][20-2F]*[30-7E]. ISO 2022 also defines semantics for SHIFT IN (0F) and SHIFT OUT (0E). ISO 2022 already defines a rich OID syntax for fully specifying character set usage and encodings (all character sets in use, which registers are initially being used, etc.). Z39.50 chose to ignore these OIDs in its character set negotiation definition, though having the information structured as in Z39.50 character set negotation is a little more user-friendly than assigning all the correct semantics to the individual components of ISO 2022 OIDs. Assigning OIDs to all combinations of character sets in ISO 2022 would generate a very, very long, if not infinite, list of OIDs. There are currently approximately 23 registered control character sets, 109 registered single-byte graphic character sets, 18 multi-byte graphic character sets (all double-byte). The registry can be viewed at http://www.itscj.ipsj.or.jp/ISO-IR/. (Most "MARC" syntaxes also have character sets based on ISO 2022. USMARC/MARC21 is encoded using a mix of two control character sets, single-byte graphic character sets, and the triple-byte CJK/EACC graphic character set. Most character sets are privately defined by the MARC standard. The MARC21 standard veers somewhat from standard ISO 2022 in some of its escape sequence usage. To be fair, MARC21 references ANSI X3.41 not ISO 2022.) What is typically referred to as US-ASCII is officially made up of two character sets, one 32-character control set and one 94-character graphic set, the two remaining characters being SPACE (20) and DELETE (7F) which are in place when 94-character graphic sets are used in ISO 2022's GL (graphic left register). ISO 2022 also allows for registration of "coding systems" that don't adhere to the "rules" of ISO 2022. Some of these character sets can be switched to with no standard way to return to ISO 2022; others have a standard return. There are approximately 21 character sets of this type, many of them various encodings of Unicode/ISO 10646. The "rules" of ISO 2022 are way too complicated to explain briefly. A free, identical version of ISO 2022 can be downloaded at ftp://ftp.ecma.ch/ecma-st/Ecma-035.pdf. Essentially ISO 2022 defines six registers, four for graphic character sets (G0-G3), two for control character sets (C0-C1). In 8-bit encoding up to four registers can be used simultaneously. The four registers correspond to the byte ranges 00-1F, 20-7F, 80-9F, A0-FF. These are named CL (control left), GL (graphic left), CR (control right), GR (graphic right), respectively. So in 8-bit encoding two control character sets and two graphic character sets can be used simultaneously. ISO 2022 defines many escape sequences, the most important of which are loading character sets into registers, shifting registers, etc. For 7-bit encoding, escape sequences are defined for shifting registers accordingly so character encodings fall into the two byte ranges 00-1F, 20-7F. In Z39.50 version 3 implementors are using GeneralString semantics for InternationalString, in the absence of character set negotiation. As such, one is essentially starting in ASCII (ISO IR 1 in C0/CL, ISO IR 6 (or 2 depending on which year's ASN.1 and BER standards one is reading) in G0/GL). GeneralString allows for escape sequences and use of all control and graphic character sets. Whether or not switching to other coding systems (like UTF-8) is allowable in GeneralStrings depends on one's reading of ASN.1 and BER, but really the standard says: All G (graphic) and all C (control) sets + SPACE + DELETE. There is no mention of other coding systems. Switching character sets/loading other character sets is possible at any time using the appropriate escape sequences. In Z39.50 version 2, VisibleString semantics are in place for InternationalString. VisibleString allows for ISO IR 6 (or 2 depending on which year's ASN.1 and BER standards one is reading) in G0/GL, and SPACE. No control characters or escape sequences are allowed. ----- Original Message ----- From: "Pieter Van Lierop" <pvanlierop@geac.fr> To: "zig" <www-zig@w3.org> Sent: Wednesday, March 13, 2002 5:28 AM Subject: character encoding ISO 2022 > Please forgive my ignorance but what is ISO 2022 exactly? > > The choice in the character set negotiation is between: > ISO2022 > ISO10646 > Private > > ISO2022, as I understand it, is an encapsulation of all classic 7-bits and > 8-bits character sets. > How many applications use ISO2022? > How do I say I send Ascii, or Latin-1? > > Wouldn't it be better, instead of ISO 2022, to make a list (extendable) of > character sets used? We could give them OID's. > I think we need the following: > > ASCII > Extended ASCII > ANSI > ALA > Latin-1 > Extended-Latin > ... > > probably a few more
Received on Wednesday, 13 March 2002 08:33:55 UTC