- From: <lwj@cs.kun.nl>
- Date: Mon, 19 Jul 1993 22:42:21 +0200
- To: Masataka Ohta <mohta@necom830.cc.titech.ac.jp>
- Cc: Guido.van.Rossum@cwi.nl, ietf-charsets@INNOSOFT.COM
Although I have no intention to start a religious war, I would like to point out some technical difficulties with Otha's proposed IUTF. Please refer to my earlier posting about NET-TEXT for an alternative proposal. > BTW, I now think that, if we are to use almost raw UTF2 as interim encoding > without enough consideration to many languages with non-European characters, > we should not use two octet UTF2 sequence beginning from T1. That is, > represent all non-ASCII characrters with three octet form of UTF2. Then, > the two octet sequences are reserved for the future international > assignment. Isn't it fair? UTF-2 (as used by Plan 9, I don't have an X/Open reference) requires that the *shortest* sequence be used (although programs may not check it), thus this would make your coding incompatible with UTF-2. I find your UTF-2 table > C0:0~32,127 > A :33~126 > Tx:128~191 > T1:192~223 > T2:224~239 > T3:240~247 > T4:248~251 > T5:252~253 > Ty:254~255(unused) a bit strange. Consider that T5 = 1111110x and the five following Tx bytes have only 30 bits available: there is no way to represent codes >= 2^31 (or maybe these don't occur in ISO 10646; please enlighten me if this is the case). Your IUTF table was > C0:0~32,127 > A':33~46,48~126 > C1:128~159 > Tx:128~191 > T1:192~223 > T2:224~239(=S2+S3+S4+S6+S7) > S2:224~229 > S3:230~235 > S4:236~237 > S6:238 > S7:239 > U1:240~255 I don't see the reason for introducing A'; could you explain please? You proposed the extra sequences > T1 A' 2976 > T2 A' 1488 > U1 A' 1488 > U1 Tx 1024 > T1 T2 512 > T1 U1 512 > U1 T2 256 > S2 Tx A' 35712 > S3 Tx A' Tx >2^21 > S4 Tx A' Tx Tx >2^25 > S6 Tx A' Tx Tx Tx Tx >2^36 > S7 Tx A' Tx Tx Tx Tx Tx >2^42 These sequences destroy the resynchronisation property: consider what happens if you hit an internal non-Tx byte: how would you know that it was internal? E.g. consider T1 A' and T1 T2 A' The "intended" parsing is [T1 A'] and [T1 T2] [A'] but you could also parse them as ... T1] [A'] and ... T1] [T2 A'] and ... T1 T2] [A'] > Hash tables could be used for the fast translation from > ICODE to IUTF for such shorthand notations. This seems a bit too complex for the purpose. -- Luc Rooijakkers Internet: lwj@cs.kun.nl SPC Group, the Netherlands UUCP: uunet!cs.kun.nl!lwj --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Tuesday, 20 July 1993 16:40:55 UTC