- From: Luc Rooijakkers <lwj@cs.kun.nl>
- Date: Sat, 17 Jul 1993 22:02:13 +0200
- To: ietf-charsets@INNOSOFT.COM
- Cc: lwj@cs.kun.nl
Suggestions for a NET-TEXT encoding, a compatible replacement for NET-ASCII Introduction During the Amsterdam IETF, a BOF was held on the topic of character sets. There seemed to be consensus among the participants that going with an UTF-2 like encoding of ISO 10646 would be preferable, but there were also some points raised: * UTF-2 character encodings rapidly grow with the character code, which is not an issue for European use but might be for Asian use or for groups or planes to be defined by ISO, depending on their placement in the 10646 coding space. * 10646 does not include all characters that are in widespread use. * The CJK unification is not endorsed by everyone. Keith Moore raised the question wether we could extend ISO 10464, perhaps making use of groups, planes or zones reserved for private use. John Klensin then pointed out that ISO has a long history of retracting such reservations, at which point the Internet would then have a severe problem. Thus, extending 10646 does not seem the way to go. There is a different route, however. The UTF-2 encoding, even when extended to 32 bits like X/Open has proposed, has unused octet sequences, and it is possible to make use of this in a way that is compatible with other UTF-2 systems. The remainder of this message enumerates the available coding options for UTF-2 extensions and suggests some possible uses of the available coding space. The UTF-2 encoding First I introduce the UTF-2 encoding and my understanding of the proposed X/Open extension to 32 bits. Since I do not have definitive references on the latter, I may be wrong in minor details, but this should not effect the basic principles of the method. See the references in my earlier posting to find out more about the history and motivation of UTF-2. The extended UTF-2 encoding is essentially a way of coding 32-bit codes into variable length octet sequences. In practice, the 32-bit codes represent characters from ISO 10646. I use the following definitions (inspired by the rune.c file from the Plan 9 text editor, Sam): T0 = 0xxxxxxx Tx = 10xxxxxx T1 = 110xxxxx T2 = 1110xxxx T3 = 11110xxx T4 = 111110xx T5 = 111111xx Octet sequences representing a single 32-bit code consist of one of the Tn codes, were n is 0 to 5, followed by n Tx codes (one may think of "x" as "extension"). The correspondence between 32-bit codes and octet sequences is as follows: T0: 00000000 00000000 00000000 0bbbbbbb <-> 0bbbbbbb T1: 00000000 00000000 00000bbb bbbbbbbbb <-> 110bbbbb 10bbbbbb T2: 00000000 00000000 bbbbbbbb bbbbbbbbb <-> 1110bbbb 10bbbbbb 10bbbbbb T3: 00000000 000bbbbb bbbbbbbb bbbbbbbbb <-> 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb T4: 000000bb bbbbbbbb bbbbbbbb bbbbbbbbb <-> 111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb T5: bbbbbbbb bbbbbbbb bbbbbbbb bbbbbbbbb <-> 111111bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb If a 32-bit code can be represented by multiple octet sequences, the shortest one is chosen. This is actually a Plan 9 requirement, and X/Open may have relaxed it. This is not fatal, however. Free coding space At first sight, it would appear that all possible octet sequences are taken. This is not the case, however. For a given 32-bit code, it is required that the shortest sequence be used. This frees up coding space in the next longer sequence: it cannot have all zeroes in the bit positions not covered by the next shorter sequence. Thus, assuming we want to keep the length properties, the free coding space is T1': 1100000x 10xxxxxx T2': 11100000 100xxxxx 10xxxxxx T3': 11110000 1000xxxx 10xxxxxx 10xxxxxx T4': 11111000 10000xxx 10xxxxxx 10xxxxxx 10xxxxxx T5': 11111100 100000xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx It is however not a good idea to use these exact sequences. The reason is that an UTF-2 implementation may not check that the leading bits are in fact non-zero (and indeed the Plan 9 implementation does not do this), which would cause our extended codes to be mistakenly interpreted as valid 32-bit codes. However, a robust implementation will check that the extension octets are in fact Tx octets. We can exploit this by using a different number of Tx octets, which UTF-2 implementation should diagnose as a bad octet sequence. If we use a larger number, this may not be diagnosed until the next sequence is decoded and this is undesirable. Thus, we restrict ourselves to a smaller number of Tx octets. This also has the desirable property that an extension sequence should result in a single "bad" code for UTF-2 implementations. The newly available sequences are then T1'0: 1100000x T2'1: 11100000 100xxxxx T3'1: 11110000 1000xxxx T3'2: 11110000 1000xxxx 10xxxxxx T4'1: 11111000 10000xxx T4'2: 11111000 10000xxx 10xxxxxx T4'3: 11111000 10000xxx 10xxxxxx 10xxxxxx T5'1: 11111100 100000xx T5'2: 11111100 100000xx 10xxxxxx T5'3: 11111100 100000xx 10xxxxxx 10xxxxxx T5'4: 11111100 100000xx 10xxxxxx 10xxxxxx 10xxxxxx provided that we use them in such a way that the length can be determined from the sequence itself. Thus, we have freed 2^1 + 2^5 + 2^4 + 2^10 + 2^3 + 2^9 + 2^15 + 2^2 + 2^8 + 2^14 + 2^20 = 2 + 32 + 16 + 1024 + 8 + 512 + 32768 + 4 + 256 + 16384 + 1048576 = 1099582 coding sequences (a little more then 2^20). Although it is possible to use non-Tx extension octets as well, this destroys some of the nice properties of the UTF-2 encoding. In particular, it complicates the algorithms for skipping code sequences and recognizing the start of such sequences. Even the present modification destroys the property that the length can be determined from the first byte, but it is possible to choose the encoding in such a way that incomplete octet sequences can be distinguished from complete sequences without referring to octets that are not part of the sequence. Use of the new coding space There are various ways to use this coding space, each of them compensating some disadvantages of the "plain" UTF-2 10646 encoding. For example, one way to use the new coding space is to allow reference to every ECMA-registered character set, by coding the character set reference together with the character. The T4'3 and T5'3 sequences are ideal for 94^1 or 96^1 character sets, since they have just enough bits. Often used ECMA sets could be given shorter sequences, by recoding the character set reference. The T5'4 sequence is not able to code all foreseeable 94^2 or 96^2 character sets, since it has only 6 bits available for the character set reference. It should be more then enough, however, for the currently registered ones (I suspect there are far less than 64 94^2 or 96^2 sets registered). Note that this technique does not use up all of the T4'3 and T5'3 coding space; character set references always have codes greater than 30 hexadecimal (3/0 in the ISO notation). Of course, characters that are part of the 10646 BMP should be coded using the "normal" UTF-2 sequences, to avoid requiring enormous mapping tables in each implementation. This technique does provide an escape hatch from the CJK unification, however, for people who deem it necessary. The new coding space can also be used to efficiently code any future extensions to ISO 10646, by compact plane and/or group encoding (e.g. using sign extension and/or skipping zero bits). Since we do not yet know what planes or groups these are going to be, we cannot specify this mapping now. If we fix the length determination, however, current implementations will behave gracefully when such extensions are made. We could require implementation to make the translation table-driven, so that future extensions are easy. There are not that many bits available for this purpose, assuming we are not going to do transformations on row/column values. Finally, note that there are 2 one-octet codes available. Does somebody know an often-used character that is not in ISO 10646? There is one desirable property that is lost by these techniques, which is that a character has only a single representation. It is preserved for the 10646 subset, however, and any extension would have the same effect if extension characters are incorporated in 10646 at some future date. Implementation As with all Internet standards, we can only mandate "on the wire" behaviour. However, it is useful to reflect somewhat on implementation aspects of this scheme. Any implementation purporting to eventually support full ISO 10646 must use more then 32 bits to represent characters, since we have introduced about 2^20 extra characters. Thus, 33 bits should be enough for most practical uses. How these bits are used is not particularly important, but one could represent extension codes by storing the first octet in the high order bits, together with some indication of the total length, while all the x bits are stored in the low order bits. This presumes a stateless encoding, however. Translation to ISO 2022 should be easy, since ECMA has registered references for the ISO 10646 set. Translation from ISO 2022 is more difficult, since ISO 10646 characters must be represented as such; this may require translation tables in some cases. If such tables are not available, translation of the ASCII subset should be trivial and provides good fallback behaviour. For some applications, it would actually be preferable to keep the byte stream form, since this guarantees information preservation and is not *that* inefficient; the overhead is at most 50% over any other stateless encoding. One such application that comes to mind is mail forwarding. Applications This encoding should be suitable for most Internet protocols that currently use NET-ASCII, and it should be usable with the DNS as is, although there are some complications with regard to case transformation (the DNS is supposed to match domain names case-insensitive). This could easily be cured, however, by restricting the character set that may be used in domain names. Of course, most protocols would need some form of negotiation to make sure that both ends understand NET-TEXT as opposed to NET-ASCII. For most implementations this would be a trivial addition, however; it mostly requires 8-bit transparance which is in general not difficult to achieve. -- Luc Rooijakkers Internet: lwj@cs.kun.nl SPC Group, the Netherlands UUCP: uunet!cs.kun.nl!lwj --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Saturday, 17 July 1993 13:02:33 UTC