Registration of new charset BOCU-1

(This is a proposal for a registration; I am using the template from RFC 2978.)

Charset name: BOCU-1

Charset aliases: (none, except for the implicit csBOCU-1)

Suitability for use in MIME text: Yes

Published specifications:
     Specification of BOCU-1 with sample code for conversion to/from Unicode:
     http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html

     Description of the general "BOCU" algorithm,
     with a link to the BOCU-1 specification:
     http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html

     A converter implementation that is conformant to this specification is
     available in ICU (http://oss.software.ibm.com/icu/), an open-source library.
     The BOCU-1 converter C source code is in icu/source/common/ucnvbocu.c:
     http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/common/ucnvbocu.c

     CCS & CES: The BOCU-1 charset is a combination of the
     Unicode and ISO 10646 Coded Character Set (CCS) with
     the Character Encoding Scheme (CES) specified in
     the above document. It covers exactly the
     UTF-16-reachable subset of ISO 10646.

ISO 10646 equivalency table:
     Algorithmic, see published specification and sample code.

Additional information:
     BOCU-1 is an encoding (CES/TES) of Unicode/ISO 10646
     for the storage and exchange of text data.
     It is stateful and provides a good byte/code point ratio while
     being directly usable in SMTP emails, database fields and other contexts.

     BOCU-1 combines the wide applicability of UTF-8 with the compactness of SCSU.
     It is useful for short strings and maintains code point order.

     BOCU-1 does not encode most ASCII characters with US-ASCII byte values.

     There is a Unicode signature byte sequence defined
     (FB EE 28, see specification).

     BOCU-1 is suitable for
     - databases: maintains Unicode code point order
     - emails: directly suitable for MIME text
     - CVS and similar: deterministic and resets at CR and LF

     BOCU-1 is not suitable for
     - efficient internal processing (convert to UTF-8/16/32)
     - contexts where encoding declarations _in_ documents _must_ be ASCII-readable

Person & email address to contact for further information:
     Markus W. Scherer
     IBM Globalization Center of Competency
     5600 Cottle Road
     Mail Stop: 50-2/B11
     San Jose, CA 95193
     USA

     markus.scherer@jtcsv.com
     markus.scherer@us.ibm.com

Intended usage: LIMITED USE

----
Suggested MIBenum value: 1020
     (first available in Unicode/ISO 10646 range; like SCSU [which is 1011])


Thank you for your consideration,

markus

Received on Tuesday, 9 July 2002 13:37:35 UTC