C3 System for character set conversion available for alpha testing

Trans-European Research and Education      ANNOUNCEMENT         Prototype Ap45
Networking Association (TERENA)
                                           1994-12-15
Coded Character Set Conversion
Task-Force (C3-TF)




ALPHA TEST RELEASE OF THE C3 SYSTEM FOR CODED CHARACTER SET CONVERSION

  
TERENA (formerly RARE) has supported the development of better
tools for conversion between the continuously growing number of
coded character sets in use in academic computer networks in
Europe. The intention is to produce a general and flexible
system for Coded CHaracter set Conversion, called

                  >>>  The C3 System  <<<

This is the announcement of the alpha test release of software
(for Unix) and tables for the C3 System for *limited*
distribution amongst interested implementors, system
administrators and users.

  +-------------------------------------------------------+
  ! Notice that this is a pre-release of software under   !
  ! development, which has not yet been thoroughly tested !
  ! and is not intended for production use.               !
  +-------------------------------------------------------+

The package consists of:

>  ANSI C code for a software library implementing parts of
   the C3 API (see below).
   
>  ANSI C code for a program "ccconv", which can be used either
   as a character stream conversion filter or as a file conversion
   program.

>  Binary files for this software, compiled for SunOS 4.3.x

>  Approximation table (see below)

>  Definition tables for the following coded character sets:

      ASCII                                  ANSI X3.4
      Swedish general 7-bit character set    SS 63 61 27
      Swedish 7-bit character set for names  SS 63 61 27
      Norwegian 7-bit character set          NS 4551
      UK 7-bit character set                 BS 4730
      Croatian/Slovene 7-bit character set   JUS I.Bl. 002
      Latin-1 8-bit character set            ISO 8859-1
      Latin-2 8-bit character set            ISO 8859-2
      Latin-Cyrillic 8-bit character set     ISO 8859-5
      Original IBM PC character set          IBM CP437
      International IBM PC character set     IBM CP850
      Macintosh Extended Roman character set
      UCS in 2-octet form at level 1         ISO 10646

>  Documentation files:

      Introduction to the C3 System (8 pages)
      Directions for the installation of the C3 System (2 pages)
      How to use the "ccconv" file conversion utility (4 pages)
      How to use the C3 library of C functions (18 pages)
      Explanation of identifiers and names used in C3 (2 pages)
      Specification of the C3 API for conversion functions (37 pages)

The software is developed with the GNU gcc compiler, but any C
compiler allowing "const" and ANSI C function prototypes should work.

The latest C3 distribution and other C3 information is
avaliable in World Wide Web through

   <URL:http://www.nada.kth.se/i18n/c3/>

or by anonymous FTP to ftp.nada.kth.se, directory
"pub/i18n/c3", i.e.

    <URL:ftp://ftp.nada.kth.se/pub/i18n/c3/>

Email addresses:

<c3-questions@nada.kth.se>       Questions, comments, bug reports, etc.
<c3-info-request@nada.kth.se>    Subscription to info-about-C3 list
<c3-request@nada.kth.se>         Subscription to discussion-about-C3 list
<c3@nada.kth.se>                 Contribution to discussion-about-C3 list

Features list:

+  Full _generality_: conversion can be done in any direction 
   between any pair of the coded character sets included in the 
   system.

+  _Approximate conversion_ when exact conversion is impossible:
   There are no arbitrary identification of different characters
   in the source and the target character sets. If the target
   character set lacks a source character, the best possible
   replacement character or string is used.

+  Can handle not only simple 7-bit and 8-bit coded character
   sets, but also _advanced character sets_ such as the 16-bit
   ISO 10646 character set (on implementation level 1) and
   stateful character sets like ISO 6937/T.61. Incomplete
   character sets, character sets lacking control characters,
   indeterministic character sets, and ambiguous character sets
   are also supported.

+  _Easy to use_ for the unsofisticated user (by means of 
   carefully chosen defaults).

+  _Flexible_ and fully configurable for the sophisticated 
   user/system administrator/application developer.

+  _Conversion parameters_ control the exact conversions 
   performed: different needs or restrictions in different 
   situations is easily handled by means of
   -  the three conversion types (one-to-one, legible, 
      reversible)
   -  separate specification of the conversion of line breaks
   -  the factor system (for varying cultural expectations 
      affecting preferrable approximate conversions).

+  _Easy to customize_: The conversion tables use a format 
   optimized for human readability which only uses the subset of 
   ISO 10646 hexadecimal values are used to refer to characters.
   82 graphic characters available in all coded character sets. 
   Different full sets of conversion tables can be used in 
   parallel.

+  _Simple to extend_: To add a new coded character set, only 
   provide a definition table for it and approximate conversions 
   for any character in it that isn't included in any already 
   defined coded character set.

+  _Scalable_: To fully define the N(N-1) possible conversion 
   paths between N different coded character sets, only N+1 
   conversion tables are needed. How conversion is to be done
   is defined by means of ISO 10646 as a common interface, but
   the actual conversion is a direct transformation from
   source character set to target character set, not involving
   a 10646 representation as an intermediate step. Temporary
   files are not needed.

What's unique in the C3 System?  

The approximation table is the most innovative element in the
C3 approach to character set conversion. It specifies for each
character in any of the character sets for which definition
tables are given, how it is to be represented approximately
(by fall-back) in the target character set, if the character is
_not_ included in that character set. Several alternative
representations are specified for some characters, to take
advantage of the different character repertoires of different
target character sets.

The conversion tables use only the invariant part of ASCII. To
indicate other characters, the hexadecimal form of the coded
representations in UCS is used. No information specific to a
certain coded character set is included in the approximation
table.

The approximation table defines three types of conversion which
the user can choose from: Type 1 converts one source character
to one target character (best for tables and fields with length
restrictions). Type 2 converts characters to a more
understandable approximate representation, which may consists
of one or a few target characters (best for prose). Type 3 is
a reversible one-character-to-many-characters conversion, which
is based on the mnemonics defined by RFC 1345.

The C3 Task Force within TERENA consists of:

   Borka Jerman-Blazic <jerman-blazic@ijs.si>
   Olle Jarnefors      <ojarnef@admin.kth.se>
   Peter Svanberg      <psv@nada.kth.se>
   Keld Simonsen       <keld@dkuug.dk>

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)

Received on Thursday, 15 December 1994 15:02:46 UTC