Re: I18N issue needs consideration from James Clark on 1997-06-12 (w3c-sgml-wg@w3.org from June 1997)

From: James Clark <jjc@jclark.com>
Date: Thu, 12 Jun 1997 11:07:29 +0700
To: w3c-sgml-wg@w3.org
Message-Id: <2.2.32.19970612040729.00ac25b0@jclark.com>
At 00:51 12/06/97 -0700, Tim Bray wrote:
>The spec is rather vague
>about what the processor ought to pass the app character-wise;

It should be able to pass any representation of the character it finds
convenient.

> an 
>initial reading would suggest that 16-bit chars are the norm, a careful
>reading reveals a couple of places where we clearly envision characters
>up to 31 bits wide.

It is passing characters to the application.  We shouldn't care how they are
represented.  You don't have to represent characters by an integer.  In
general the right choice will depend on the language being used.  If the
language has built in support for a 16-bit representation it may will be
convenient to use that.  If it has built-in support for a 32-bit
representation it may well be convenient to use that.

>ISO [you mean Unicode] says that characters should always be passed around
in 16-bit
>chunks.  It reserves two blocks of 1024 chars each that will never
>be used for other purposes called "low surrogate" and "high surrogate".
>For characters that extend past the Basic Multilingual Plane (the 
>basic 64K 16-bit chars) they are given in two 16-bit chunks, the
>first of which must come from the low surrogate block, the second
>from the high surrogate block.  This gets you about a million extra
>characters, organized in 16 planes of 64K chars.  The encoding is
>completely unambiguous, you can look at any 16-bit quantity and
>if it's half of a 32-bit character, you know. 

Isn't that the same as UTF-16 which is in Amendment 1 to ISO/IEC 10646? The
way I look at it is that it's a variable-length encoding in which some
characters are represented by 2 bytes and others by 4 bytes.  It has the
nice property that documents encoded with UTF-16 will mostly work OK with
applications that think they're dealing with UCS-2, just as documents
encoded with UTF-8 often work OK with applications that think they're
dealing with ASCII.  But the two 16-bit surrogates aren't proper characters
any more than are each individual byte in a multibyte sequence encoding a
character in UTF-8.

We shouldn't be mandating a particular character encoding for an API.  An
application should be able to use UTF-8 or UTF-16 or UCS-4, as it sees fit.

>Having said all that, I will abandon the relatively even-handed
>tone and say that I think we ought simply to line up with Unicode.
>This will have the concrete effect that XML processors will be
>required always to pass 16-bit chunks to applications.  By the
>way, this is how Java works, and in a very hard-coded way.  The
>encoding scheme is entirely without ambiguity. 

Certainly in Java it would make a lot of sense to UTF-16.  But not
everybody's using Java.

> I have no sympathy
>for the ISO claim that the 31-bit version is more fixed-width in
>any meaningful sense, since Unicode is full of combining characters
>anyhow.

Not all scripts have combining characters.   If I am working with a script
that doesn't have combining characters and does use a lot of characters
outside the BMP (Chinese for example), then it would make sense to use
internally a 32-bit fixed width encoding.

Another way to represent things internally is:

struct String {
  unsigned short *s;
  size_t n;
  bool ucs4;
};

If the ucs4 flag is true then this string represents each character in the
string with two unsigned shorts; otherwise it represents it with one
unsigned short.  In other words each string uses a fixed width encoding, but
different strings may use different widths.

>Also, philosophically, once you get outside the 16-bit BMP, you
>are no longer dealing with characters that are routinely
>available in any computer text processing system available anywhere
>in the world.  Forcing ourselves to use 31 bits, and thus wasting
>50% of character buffer storage in 99.999999% of all cases, seems
>entirely out of the spirit of XML.

Right, so don't require a particular representation.  I just don't see your
argument against the current position of "leave it carefully vague as to
what should be passed".  It's not our business in XML-lang to design APIs.

The place where this needs to be addressed is when you do a binding of the
DOM to a particular programming language.  At that point you need to say how
the data types provided by that programming language are used to represent
the abstract character data type.

James
Received on Thursday, 12 June 1997 00:25:22 UTC