I18N issue needs consideration

Over the last several months, I have had contact with several members of the
I18n and more specifically web-i18n community, who have pointed out a
potential problem with the latest draft of XML-lang.  I have discussed this
with Michael, and we are not fully in agreement as to how this should move
forward; I take this as prima facie evidence that there is an issue of policy
here that needs input from this group.  Jon has asked me to raise this here
and hopefully we can sort it out in the next rev of the spec.

Right now, the spec references both Unicode 2.0 and ISO 10646.  These
each define 30-thousand-odd characters.  They are the same characters,
and they have the same encoding.  This is good.  The XML spec says that
characters are from this set, which is fine.  The spec is rather vague
about what the processor ought to pass the app character-wise; an 
initial reading would suggest that 16-bit chars are the norm, a careful
reading reveals a couple of places where we clearly envision characters
up to 31 bits wide.

This is material, because the 30-odd-K Unicode/ISO now have do not
include all the Chinese characters there are or ever have been (although
it does include all of those that are typically available on computer
systems).  The Chinese folks have several tens of thousands more 
getting queued up for addition.  Also incoming are some dead scripts
such as Aztec, Maya, and (I have heard) Tolkien-Elvish and Klingon.

One way or another, this is going to spill over the 64k limit.  And 
unfortunately, once you get past 64k, Unicode and ISO no longer
are in a state of happy unity.  The issue of policy we have to decide
is... in the spec, should we:

a) leave it carefully vague as to what should be passed
b) line up with the Unicode camp 
c) line up with the ISO camp

Now here's a problem.  I'm not sure it would be appropriate for me,
in this forum, to explain what these options are and what they
mean.  Anybody who wants to pitch in on this issue should really
Really *REALLY* go and pick up the Unicode 2.0 standard and read it.
It is kind of expensive, but an all-around good piece of work that
is a pleasure to read.  

Having said this, the following is a vastly oversimplified summary
of the ISO & Unicode world-views, provided only as a teaser to 
motivate you to go and read up:

ISO says that characters should always be passed around in 16-bit
chunks.  It reserves two blocks of 1024 chars each that will never
be used for other purposes called "low surrogate" and "high surrogate".
For characters that extend past the Basic Multilingual Plane (the 
basic 64K 16-bit chars) they are given in two 16-bit chunks, the
first of which must come from the low surrogate block, the second
from the high surrogate block.  This gets you about a million extra
characters, organized in 16 planes of 64K chars.  The encoding is
completely unambiguous, you can look at any 16-bit quantity and
if it's half of a 32-bit character, you know.  A system that doesn't
know this stuff that gets one of these would display it as two blobs
on the screen.  A system that knew the basic schema, but not the 
actual 32-bit char, would display it as one blob.  A system that knew
the big character could actually display it.

On the ISO side (but I'm not the right person to explain this for
reasons that will become clear below) the preference is for a flat
31-bit character address space.  There are a variety of reasons
for this; the one that speaks most clearly to me is based on history:
we thought 16-bit computers were enough, then we thought 32-bit
computers were enough, let's not do this to ourselves again.

==============================================================

Having said all that, I will abandon the relatively even-handed
tone and say that I think we ought simply to line up with Unicode.
This will have the concrete effect that XML processors will be
required always to pass 16-bit chunks to applications.  By the
way, this is how Java works, and in a very hard-coded way.  The
encoding scheme is entirely without ambiguity.  I have no sympathy
for the ISO claim that the 31-bit version is more fixed-width in
any meaningful sense, since Unicode is full of combining characters
anyhow.

Also, philosophically, once you get outside the 16-bit BMP, you
are no longer dealing with characters that are routinely
available in any computer text processing system available anywhere
in the world.  Forcing ourselves to use 31 bits, and thus wasting
50% of character buffer storage in 99.999999% of all cases, seems
entirely out of the spirit of XML.

 - Tim

Received on Wednesday, 11 June 1997 18:53:13 UTC