RE: Thoughts about characters transmission from Masataka Ohta on 1993-07-19 (ietf-charsets@w3.org from July to September 1993)

From: Masataka Ohta <mohta@necom830.cc.titech.ac.jp>
Date: Mon, 19 Jul 1993 22:44:33 +0900 (JST)
To: Guido.van.Rossum@cwi.nl
Cc: ietf-charsets@INNOSOFT.COM
Message-id: <9307191344.AA05382@necom830.cc.titech.ac.jp>
> You've been referring to this encoding in the working group also.  Do
> you have a description of how it works?  All I need is enough
> information to be able to implement it --
 
OK.
 
> no arguments about why it
> is better than UTF-2 are necessary (I can figure that out for myself).
 
But, to me, it easier to figure out IUTF than to add the argument. :-)
 
> A pointer to an ftp site or WWW or Gopher server would be fine too.
 
It's not so lengthy even with the argument. So, it is attached at the
end of this mail.
 
BTW, I now think that, if we are to use almost raw UTF2 as interim encoding
without enough consideration to many languages with non-European characters,
we should not use two octet UTF2 sequence beginning from T1. That is,
represent all non-ASCII characrters with three octet form of UTF2. Then,
the two octet sequences are reserved for the future international
assignment. Isn't it fair?
 
						Masataka Ohta
 
     IUTF (Internationalized UTF) is an interchange form for
ICODE compatible to UTF2 (UCS Transformation Format 2).
 
     UTF2 is an ASCII compatible variable length multi octet
interchange form for ISO 10646 proposed by X/Open.
 
     UTF2 is designed considering
 
1)   compatibility to UNIX file system
 
2)   compatibility to existing programs
 
3)   easy conversion between UTF2 and ISO 10646
 
4)   that code length can be determined by the first octet
 
5)   that code length is short
 
6)   finite resynchronizability
 
     In UTF2, an octet is classified as
 
        C0:0~32,127
        A :33~126
        Tx:128~191
        T1:192~223
        T2:224~239
        T3:240~247
        T4:248~251
        T5:252~253
        Ty:254~255(unused)
 
 
     Then, the following combinations of octets
 
        Octet Sequence     code of ISO 10646
        C0                 0~32,127
        A                  33~126
        T1 Tx              128~2047
        T2 Tx Tx           2048~2^16-1
        T3 Tx Tx Tx        2^16~2^21-1
        T4 Tx Tx Tx Tx     2^21~2^26-1
        T5 Tx Tx Tx Tx Tx  2^26~2^31-1
 
are used to represent characters in ISO 10646.  Resynchroni-
zation  of  character  boundaries is possible by scanning at
most 6 characters.
 
     Note that, with UTF2, all the characters of major Euro-
pean  languages can be represented in two octets and all the
existing characters of ISO 10646 can be represented in three
octets.
 
     So, IUTF is designed considering
 
0)   compatibility to UTF2
 
1)   compatibility to UNIX file system
 
2)   compatibility to existing programs as interchange code
 
3)   fast conversion between IUTF and ISO 10646
 
4)   that code length can be  determined  without  looking
     ahead extra octets
 
5)   that code length is short
 
6)   finite resynchronizability
 
that is, IUTF is upper compatible to UTF2 both in its format
and  its  design policy.  Note that 2) is rather meaningless
condition as processing code (ICODE, not IUTF, in this case)
is  used  in  exsisting programs, which is also a processing
model of multibyte/wide characters of ANSI C and X/Open.
 
     In UTF2, an octet is classified as
 
        C0:0~32,127
        A :33~126
        A':33~46,48~126
        C1:128~159
        Tx:128~191
        T1:192~223
        T2:224~239(=S2+S3+S4+S6+S7)
        S2:224~229
        S3:230~235
        S4:236~237
        S6:238
        S7:239
        U1:240~255
 
Then, the following combinations of octets
 
        Octet Sequence     code of ISO 10646
        C0                 0~32,127
        A                  33~126
        T1 Tx              128~4095
        T2 Tx Tx           4096~65535
 
are used to represent characters in  UTF2.   Thus,  IUTF  is
compatible  to  UTF2.   Then,  the following combinations of
octets are available to represent extra characters.
 
        Octet Sequence          number of code points represented
        T1 A'                   2976
        T2 A'                   1488
        U1 A'                   1488
        U1 Tx                   1024
        T1 T2                   512
        T1 U1                   512
        U1 T2                   256
        S2 Tx A'                35712
        S3 Tx A' Tx             >2^21
        S4 Tx A' Tx Tx          >2^25
        S6 Tx A' Tx Tx Tx Tx    >2^36
        S7 Tx A' Tx Tx Tx Tx Tx >2^42
 
Thus, all the character in 21 bit ICODE can  be  represented
with  four  octet  form  by  a  sequence  beginning with S3.
Resynchronization of character  boundaries  is  possible  by
scanning at most 8 characters.
 
     As IUTF have extra 8256 (= 2976 + 1488 + 1488 + 1024  +
512  +  512 + 256) two octet representations and 35712 three
octet representations, which can  be  used  for  short  hand
notations of characters such as frequently used non-European
characters.  The actual assignment is  not  yet  determined.
Hash  tables  could be used for the fast translation from
ICODE to IUTF for such shorthand notations.

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Monday, 19 July 1993 06:48:45 UTC