RE: Fwd: Last Call: UTF-16, an encoding of ISO 10646 to Proposed from Kenneth Whistler on 1999-12-16 (ietf-charsets@w3.org from October to December 1999)

From: Kenneth Whistler <kenw@sybase.com>
Date: Thu, 16 Dec 1999 14:20:32 -0800 (PST)
To: Harald@Alvestrand.no
Cc: ietf-charsets@iana.org, kenw@sybase.com, mark.davis@us.ibm.com
Message-id: <199912162220.OAA00564@birdie.sybase.com>
Harald,

> At 10:25 16.12.99 -0800, Kenneth Whistler wrote:
> 
> 
> > > - Inability to represent characters outside Planes 0-16
> >
> >WG2 and UTC are converging on a point of view that characters
> >outside of Planes 0-16 should *never* be assigned. This may be
> >formally written into 10646. The rationale here is that nearly
> >all 10646 implementations are following the Unicode Standard, by
> >necessity, to achieve interoperability in areas that are left
> >unspecified by 10646. Formalizing this convergence by constraining
> >the code space range that could ever be assigned standard characters
> >would close down this nagging issue of incompatibility between
> >the Unicode Standard and 10646. In that case, UTF-8, UTF-16, and
> >UTF-32 would *all* have the exact same representational capability,
> >and would all be completely interconvertible forms.
> 
> See http://www.unicode.org/pending/pending.html
> It's entirely possible that all commonly used scripts will be encoded in 
> Plane 0 (if those who fight for traditional Chinese and more precomposed 
> characters give up), but I don't think it's likely that ISO will abandon 
> Plane 1.
> 

I think we may be talking at cross-purposes here. *I* am responsible
for maintaining the content of http://www.unicode.org/pending/pending.html,
by the way.

UTC and ISO/JTC1/SC2/WG2 *both* are committed to allowing encoding in 
Planes 0..16.

The voting on ISO/IEC 10646-2 is already underway, with encoded characters
in Plane 1 (Etruscan, Gothic, Deseret, Byzantine and Western musical
symbols, mathematical alphanumeric symbols) and in Plane 2 (>47000 more
Han characters for Vertical Extension B) and in Plane 14 (language tag
characters). Other than a few technical details here and there that may
change, it is quite likely that the entire collection will pass ballot
and become part of 10646 (and shortly thereafter, Unicode 3.1 or 4.0).

What Planes 0..16 give us are:

Plane 0: BMP   49194 characters assigned
                6400 private use
                2048 surrogate codes
                  65 control codes
                   2 not characters
                7827 still assignable code points

Plane 1: SMP   ~1600 characters (alphabets & symbols) under ballot
              ~64000 still assignable code points

Plane 2: SIP  ~47000 Han characters (Vertical Extension B) under ballot
              ~18500 still assignable code points

Plane 3: SIP2  65534 still assignable code points (probably for more Han)

Planes 4..13  655340 still assignable code points

Plane 14: SPP     97 language tag characters underballot
               65437 still assignable code points

Planes 15..16 131068 private use

As Fran�ois pointed out, we are running out of characters to encode.
Even the IRG, busily culling the vast history of Han character
usage throughout East Asia, has put the bulk of its remaining backlog
into ballot for Plane 2 (the 47000+ for Vertical Extension B). After
that, Han will come in relative dribs and drabs and get more and more
obscure. The standards committees are already getting deadlocked about
how to proceed with the historic scripts -- particularly the relatively
large ones like hieroglyphics -- so after the publication of 10646-2,
the script additions will slow down dramatically. It will take another
decade at least to fill Plane 1 will the candidates we already know
for encoding (and that includes *all* of the big hieroglyphic dead
scripts like Egyptian, Sumerian, Hittite, Mayan, and Indus Valley).

So that big gap in planes 4..13 looms unused and effectively unusable.
Nobody in the professional character encoding community has any
candidates to put there that would really count as characters. There
are various bizarre schemes that could eat numbers, but as long as
10646 and the Unicode Standard remain *character* encoding standards,
it is quite likely that those 10 planes will simply be held in
reserve. This is enough engineering slack on the character encoding
to last through this upcoming century easily, even if the World
Congress on Universal Orthography decides to invent and impose a
new world orthography each and every decade. :-) 

That is why the UTC and WG2 just want to formally close the books
on this. 16 bits turned out not to be enough for everything that
somebody wanted to encode as characters. But all projections are
that 21 bits *is* enough, and we can hold the line there. Nobody
needs 31 bits for character encoding.

--Ken
Received on Thursday, 16 December 1999 17:23:49 UTC