- From: Masataka Ohta <mohta@necom830.cc.titech.ac.jp>
- Date: Mon, 19 Jul 1993 22:44:33 +0900 (JST)
- To: Guido.van.Rossum@cwi.nl
- Cc: ietf-charsets@INNOSOFT.COM
> You've been referring to this encoding in the working group also. Do > you have a description of how it works? All I need is enough > information to be able to implement it -- OK. > no arguments about why it > is better than UTF-2 are necessary (I can figure that out for myself). But, to me, it easier to figure out IUTF than to add the argument. :-) > A pointer to an ftp site or WWW or Gopher server would be fine too. It's not so lengthy even with the argument. So, it is attached at the end of this mail. BTW, I now think that, if we are to use almost raw UTF2 as interim encoding without enough consideration to many languages with non-European characters, we should not use two octet UTF2 sequence beginning from T1. That is, represent all non-ASCII characrters with three octet form of UTF2. Then, the two octet sequences are reserved for the future international assignment. Isn't it fair? Masataka Ohta IUTF (Internationalized UTF) is an interchange form for ICODE compatible to UTF2 (UCS Transformation Format 2). UTF2 is an ASCII compatible variable length multi octet interchange form for ISO 10646 proposed by X/Open. UTF2 is designed considering 1) compatibility to UNIX file system 2) compatibility to existing programs 3) easy conversion between UTF2 and ISO 10646 4) that code length can be determined by the first octet 5) that code length is short 6) finite resynchronizability In UTF2, an octet is classified as C0:0~32,127 A :33~126 Tx:128~191 T1:192~223 T2:224~239 T3:240~247 T4:248~251 T5:252~253 Ty:254~255(unused) Then, the following combinations of octets Octet Sequence code of ISO 10646 C0 0~32,127 A 33~126 T1 Tx 128~2047 T2 Tx Tx 2048~2^16-1 T3 Tx Tx Tx 2^16~2^21-1 T4 Tx Tx Tx Tx 2^21~2^26-1 T5 Tx Tx Tx Tx Tx 2^26~2^31-1 are used to represent characters in ISO 10646. Resynchroni- zation of character boundaries is possible by scanning at most 6 characters. Note that, with UTF2, all the characters of major Euro- pean languages can be represented in two octets and all the existing characters of ISO 10646 can be represented in three octets. So, IUTF is designed considering 0) compatibility to UTF2 1) compatibility to UNIX file system 2) compatibility to existing programs as interchange code 3) fast conversion between IUTF and ISO 10646 4) that code length can be determined without looking ahead extra octets 5) that code length is short 6) finite resynchronizability that is, IUTF is upper compatible to UTF2 both in its format and its design policy. Note that 2) is rather meaningless condition as processing code (ICODE, not IUTF, in this case) is used in exsisting programs, which is also a processing model of multibyte/wide characters of ANSI C and X/Open. In UTF2, an octet is classified as C0:0~32,127 A :33~126 A':33~46,48~126 C1:128~159 Tx:128~191 T1:192~223 T2:224~239(=S2+S3+S4+S6+S7) S2:224~229 S3:230~235 S4:236~237 S6:238 S7:239 U1:240~255 Then, the following combinations of octets Octet Sequence code of ISO 10646 C0 0~32,127 A 33~126 T1 Tx 128~4095 T2 Tx Tx 4096~65535 are used to represent characters in UTF2. Thus, IUTF is compatible to UTF2. Then, the following combinations of octets are available to represent extra characters. Octet Sequence number of code points represented T1 A' 2976 T2 A' 1488 U1 A' 1488 U1 Tx 1024 T1 T2 512 T1 U1 512 U1 T2 256 S2 Tx A' 35712 S3 Tx A' Tx >2^21 S4 Tx A' Tx Tx >2^25 S6 Tx A' Tx Tx Tx Tx >2^36 S7 Tx A' Tx Tx Tx Tx Tx >2^42 Thus, all the character in 21 bit ICODE can be represented with four octet form by a sequence beginning with S3. Resynchronization of character boundaries is possible by scanning at most 8 characters. As IUTF have extra 8256 (= 2976 + 1488 + 1488 + 1024 + 512 + 512 + 256) two octet representations and 35712 three octet representations, which can be used for short hand notations of characters such as frequently used non-European characters. The actual assignment is not yet determined. Hash tables could be used for the fast translation from ICODE to IUTF for such shorthand notations. --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Monday, 19 July 1993 06:48:45 UTC