- From: Kenneth Whistler <kenw@sybase.com>
- Date: Thu, 16 Dec 1999 14:20:32 -0800 (PST)
- To: Harald@Alvestrand.no
- Cc: ietf-charsets@iana.org, kenw@sybase.com, mark.davis@us.ibm.com
Harald, > At 10:25 16.12.99 -0800, Kenneth Whistler wrote: > > > > > - Inability to represent characters outside Planes 0-16 > > > >WG2 and UTC are converging on a point of view that characters > >outside of Planes 0-16 should *never* be assigned. This may be > >formally written into 10646. The rationale here is that nearly > >all 10646 implementations are following the Unicode Standard, by > >necessity, to achieve interoperability in areas that are left > >unspecified by 10646. Formalizing this convergence by constraining > >the code space range that could ever be assigned standard characters > >would close down this nagging issue of incompatibility between > >the Unicode Standard and 10646. In that case, UTF-8, UTF-16, and > >UTF-32 would *all* have the exact same representational capability, > >and would all be completely interconvertible forms. > > See http://www.unicode.org/pending/pending.html > It's entirely possible that all commonly used scripts will be encoded in > Plane 0 (if those who fight for traditional Chinese and more precomposed > characters give up), but I don't think it's likely that ISO will abandon > Plane 1. > I think we may be talking at cross-purposes here. *I* am responsible for maintaining the content of http://www.unicode.org/pending/pending.html, by the way. UTC and ISO/JTC1/SC2/WG2 *both* are committed to allowing encoding in Planes 0..16. The voting on ISO/IEC 10646-2 is already underway, with encoded characters in Plane 1 (Etruscan, Gothic, Deseret, Byzantine and Western musical symbols, mathematical alphanumeric symbols) and in Plane 2 (>47000 more Han characters for Vertical Extension B) and in Plane 14 (language tag characters). Other than a few technical details here and there that may change, it is quite likely that the entire collection will pass ballot and become part of 10646 (and shortly thereafter, Unicode 3.1 or 4.0). What Planes 0..16 give us are: Plane 0: BMP 49194 characters assigned 6400 private use 2048 surrogate codes 65 control codes 2 not characters 7827 still assignable code points Plane 1: SMP ~1600 characters (alphabets & symbols) under ballot ~64000 still assignable code points Plane 2: SIP ~47000 Han characters (Vertical Extension B) under ballot ~18500 still assignable code points Plane 3: SIP2 65534 still assignable code points (probably for more Han) Planes 4..13 655340 still assignable code points Plane 14: SPP 97 language tag characters underballot 65437 still assignable code points Planes 15..16 131068 private use As François pointed out, we are running out of characters to encode. Even the IRG, busily culling the vast history of Han character usage throughout East Asia, has put the bulk of its remaining backlog into ballot for Plane 2 (the 47000+ for Vertical Extension B). After that, Han will come in relative dribs and drabs and get more and more obscure. The standards committees are already getting deadlocked about how to proceed with the historic scripts -- particularly the relatively large ones like hieroglyphics -- so after the publication of 10646-2, the script additions will slow down dramatically. It will take another decade at least to fill Plane 1 will the candidates we already know for encoding (and that includes *all* of the big hieroglyphic dead scripts like Egyptian, Sumerian, Hittite, Mayan, and Indus Valley). So that big gap in planes 4..13 looms unused and effectively unusable. Nobody in the professional character encoding community has any candidates to put there that would really count as characters. There are various bizarre schemes that could eat numbers, but as long as 10646 and the Unicode Standard remain *character* encoding standards, it is quite likely that those 10 planes will simply be held in reserve. This is enough engineering slack on the character encoding to last through this upcoming century easily, even if the World Congress on Universal Orthography decides to invent and impose a new world orthography each and every decade. :-) That is why the UTC and WG2 just want to formally close the books on this. 16 bits turned out not to be enough for everything that somebody wanted to encode as characters. But all projections are that 21 bits *is* enough, and we can hold the line there. Nobody needs 31 bits for character encoding. --Ken
Received on Thursday, 16 December 1999 17:23:49 UTC