NET-TEXT: an extension of UTF-2 (long) from Luc Rooijakkers on 1993-07-17 (ietf-charsets@w3.org from July to September 1993)

From: Luc Rooijakkers <lwj@cs.kun.nl>
Date: Sat, 17 Jul 1993 22:02:13 +0200
To: ietf-charsets@INNOSOFT.COM
Cc: lwj@cs.kun.nl
Message-id: <199307172002.AA17215@zeus.cs.kun.nl>
		Suggestions for a NET-TEXT encoding,
		a compatible replacement for NET-ASCII

Introduction

During the Amsterdam IETF, a BOF was held on the topic of character
sets. There seemed to be consensus among the participants that going with 
an UTF-2 like encoding of ISO 10646 would be preferable, but there were
also some points raised:

	* UTF-2 character encodings rapidly grow with the character
	  code, which is not an issue for European use but might
	  be for Asian use or for groups or planes to be defined by ISO,
	  depending on their placement in the 10646 coding space.

	* 10646 does not include all characters that are in widespread use.

	* The CJK unification is not endorsed by everyone.

Keith Moore raised the question wether we could extend ISO 10464, perhaps
making use of groups, planes or zones reserved for private use.
John Klensin then pointed out that ISO has a long history of retracting
such reservations, at which point the Internet would then have a severe
problem. Thus, extending 10646 does not seem the way to go.

There is a different route, however. The UTF-2 encoding, even when
extended to 32 bits like X/Open has proposed, has unused octet
sequences, and it is possible to make use of this in a way that is
compatible with other UTF-2 systems. The remainder of this message
enumerates the available coding options for UTF-2 extensions and
suggests some possible uses of the available coding space.

The UTF-2 encoding

First I introduce the UTF-2 encoding and my understanding of the
proposed X/Open extension to 32 bits. Since I do not have definitive
references on the latter, I may be wrong in minor details, but this
should not effect the basic principles of the method. See the references
in my earlier posting to find out more about the history and motivation
of UTF-2.

The extended UTF-2 encoding is essentially a way of coding 32-bit codes into
variable length octet sequences. In practice, the 32-bit codes represent
characters from ISO 10646. I use the following definitions (inspired by
the rune.c file from the Plan 9 text editor, Sam):

	T0 = 0xxxxxxx
	Tx = 10xxxxxx
	T1 = 110xxxxx
	T2 = 1110xxxx
	T3 = 11110xxx
	T4 = 111110xx
	T5 = 111111xx

Octet sequences representing a single 32-bit code consist of one of the
Tn codes, were n is 0 to 5, followed by n Tx codes (one may think of
"x" as "extension").

The correspondence between 32-bit codes and octet sequences is as
follows:

 T0:	00000000 00000000 00000000 0bbbbbbb

   <->	0bbbbbbb


 T1:	00000000 00000000 00000bbb bbbbbbbbb
	
   <->	110bbbbb 10bbbbbb
	

 T2: 	00000000 00000000 bbbbbbbb bbbbbbbbb

   <-> 	1110bbbb 10bbbbbb 10bbbbbb


 T3:	00000000 000bbbbb bbbbbbbb bbbbbbbbb

   <->	11110bbb 10bbbbbb 10bbbbbb 10bbbbbb


 T4:	000000bb bbbbbbbb bbbbbbbb bbbbbbbbb

   <->	111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb


 T5:	bbbbbbbb bbbbbbbb bbbbbbbb bbbbbbbbb 

   <->	111111bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb

If a 32-bit code can be represented by multiple octet sequences,
the shortest one is chosen. This is actually a Plan 9 requirement,
and X/Open may have relaxed it. This is not fatal, however.

Free coding space

At first sight, it would appear that all possible octet sequences are
taken. This is not the case, however. For a given 32-bit code,
it is required that the shortest sequence be used. This frees up coding
space in the next longer sequence: it cannot have all zeroes in the bit
positions not covered by the next shorter sequence.

Thus, assuming we want to keep the length properties, the free coding space is

 T1':	1100000x 10xxxxxx

 T2':	11100000 100xxxxx 10xxxxxx

 T3':	11110000 1000xxxx 10xxxxxx 10xxxxxx

 T4':	11111000 10000xxx 10xxxxxx 10xxxxxx 10xxxxxx

 T5':	11111100 100000xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
	
It is however not a good idea to use these exact sequences. The reason
is that an UTF-2 implementation may not check that the leading bits are
in fact non-zero (and indeed the Plan 9 implementation does not do this),
which would cause our extended codes to be mistakenly interpreted as 
valid 32-bit codes. However, a robust implementation will check that
the extension octets are in fact Tx octets. We can exploit this by
using a different number of Tx octets, which UTF-2 implementation should
diagnose as a bad octet sequence. If we use a larger number, this may
not be diagnosed until the next sequence is decoded and this is
undesirable. Thus, we restrict ourselves to a smaller number of Tx
octets. This also has the desirable property that an extension sequence
should result in a single "bad" code for UTF-2 implementations.
The newly available sequences are then

 T1'0:	1100000x

 T2'1:	11100000 100xxxxx

 T3'1:	11110000 1000xxxx
 T3'2:	11110000 1000xxxx 10xxxxxx

 T4'1:	11111000 10000xxx
 T4'2:	11111000 10000xxx 10xxxxxx
 T4'3:	11111000 10000xxx 10xxxxxx 10xxxxxx

 T5'1:	11111100 100000xx
 T5'2:	11111100 100000xx 10xxxxxx
 T5'3:	11111100 100000xx 10xxxxxx 10xxxxxx
 T5'4:	11111100 100000xx 10xxxxxx 10xxxxxx 10xxxxxx

provided that we use them in such a way that the length can be
determined from the sequence itself. Thus, we have freed

	2^1 +
	2^5 +
	2^4 + 2^10 +
	2^3 + 2^9 + 2^15 +
	2^2 + 2^8 + 2^14 + 2^20
     =
	2 +
	32 + 	
	16 + 1024 +
	8 + 512 + 32768 +
	4 + 256 + 16384 + 1048576
     =
	1099582

coding sequences (a little more then 2^20).

Although it is possible to use non-Tx extension octets as well, this
destroys some of the nice properties of the UTF-2 encoding. In particular,
it complicates the algorithms for skipping code sequences and recognizing
the start of such sequences. Even the present modification destroys the
property that the length can be determined from the first byte, but
it is possible to choose the encoding in such a way that incomplete
octet sequences can be distinguished from complete sequences without
referring to octets that are not part of the sequence.

Use of the new coding space

There are various ways to use this coding space, each of them
compensating some disadvantages of the "plain" UTF-2 10646 encoding.

For example, one way to use the new coding space is to allow reference to
every ECMA-registered character set, by coding the character set
reference together with the character. The T4'3 and T5'3 sequences are
ideal for 94^1 or 96^1 character sets, since they have just enough
bits. Often used ECMA sets could be given shorter sequences, by recoding
the character set reference. The T5'4 sequence is not able to code all
foreseeable 94^2 or 96^2 character sets, since it has only 6 bits
available for the character set reference. It should be more then enough,
however, for the currently registered ones (I suspect there are far less
than 64 94^2 or 96^2 sets registered). Note that this technique does not
use up all of the T4'3 and T5'3 coding space; character set references
always have codes greater than 30 hexadecimal (3/0 in the ISO notation).

Of course, characters that are part of the 10646 BMP should be coded
using the "normal" UTF-2 sequences, to avoid requiring enormous mapping
tables in each implementation. This technique does provide an escape
hatch from the CJK unification, however, for people who deem it
necessary.

The new coding space can also be used to efficiently code any future
extensions to ISO 10646, by compact plane and/or group encoding
(e.g. using sign extension and/or skipping zero bits). Since we
do not yet know what planes or groups these are going to be,
we cannot specify this mapping now. If we fix the length determination,
however, current implementations will behave gracefully when such
extensions are made. We could require implementation to make
the translation table-driven, so that future extensions are easy.
There are not that many bits available for this purpose, assuming
we are not going to do transformations on row/column values.

Finally, note that there are 2 one-octet codes available. Does
somebody know an often-used character that is not in ISO 10646?

There is one desirable property that is lost by these techniques,
which is that a character has only a single representation. It is
preserved for the 10646 subset, however, and any extension
would have the same effect if extension characters are incorporated in
10646 at some future date.

Implementation

As with all Internet standards, we can only mandate "on the wire"
behaviour. However, it is useful to reflect somewhat on
implementation aspects of this scheme.

Any implementation purporting to eventually support full ISO 10646
must use more then 32 bits to represent characters, since we have
introduced about 2^20 extra characters. Thus, 33 bits should be
enough for most practical uses. How these bits are used is not
particularly important, but one could represent extension codes
by storing the first octet in the high order bits, together with some
indication of the total length, while all the x bits are stored in
the low order bits.

This presumes a stateless encoding, however. Translation to ISO 2022
should be easy, since ECMA has registered references for the ISO 10646
set. Translation from ISO 2022 is more difficult, since ISO 10646
characters must be represented as such; this may require translation
tables in some cases. If such tables are not available, translation
of the ASCII subset should be trivial and provides good fallback
behaviour.

For some applications, it would actually be preferable to keep the byte
stream form, since this guarantees information preservation and is
not *that* inefficient; the overhead is at most 50% over any other
stateless encoding. One such application that comes to mind is mail
forwarding.

Applications

This encoding should be suitable for most Internet protocols that
currently use NET-ASCII, and it should be usable with the DNS as is,
although there are some complications with regard to case transformation
(the DNS is supposed to match domain names case-insensitive). This could
easily be cured, however, by restricting the character set that may be
used in domain names.

Of course, most protocols would need some form of negotiation to make
sure that both ends understand NET-TEXT as opposed to NET-ASCII. 
For most implementations this would be a trivial addition, however;
it mostly requires 8-bit transparance which is in general not difficult
to achieve.

--
Luc Rooijakkers                                 Internet: lwj@cs.kun.nl
SPC Group, the Netherlands                      UUCP: uunet!cs.kun.nl!lwj

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Saturday, 17 July 1993 13:02:33 UTC