Re: I18N issue needs consideration from Gavin Nicol on 1997-06-12 (w3c-sgml-wg@w3.org from June 1997)

From: Gavin Nicol <gtn@eps.inso.com>
Date: Wed, 11 Jun 1997 20:05:21 -0400
To: w3c-sgml-wg@w3.org
Message-Id: <199706120005.UAA22951@nathaniel.eps.inso.com>
>Over the last several months, I have had contact with several members of the
>I18n and more specifically web-i18n community, who have pointed out a
>potential problem with the latest draft of XML-lang. 

Can I ask who?

>One way or another, this is going to spill over the 64k limit.  And 
>unfortunately, once you get past 64k, Unicode and ISO no longer
>are in a state of happy unity.  The issue of policy we have to decide
>is... in the spec, should we:
>
>a) leave it carefully vague as to what should be passed
>b) line up with the Unicode camp 
>c) line up with the ISO camp

I would prefer (a).

>ISO says that characters should always be passed around in 16-bit
>chunks.  It reserves two blocks of 1024 chars each that will never
>be used for other purposes called "low surrogate" and "high surrogate".

I think you mean Unicode, not ISO.

>On the ISO side (but I'm not the right person to explain this for
>reasons that will become clear below) the preference is for a flat
>31-bit character address space.  There are a variety of reasons
...

>Having said all that, I will abandon the relatively even-handed
>tone and say that I think we ought simply to line up with Unicode.
>This will have the concrete effect that XML processors will be
>required always to pass 16-bit chunks to applications.  By the
>way, this is how Java works, and in a very hard-coded way.  The
>encoding scheme is entirely without ambiguity.  I have no sympathy
>for the ISO claim that the 31-bit version is more fixed-width in
>any meaningful sense, since Unicode is full of combining characters
>anyhow.

I think you have clouded the issue somewhat. Deciding to use code
points from ISO, or from Unicode, does not necessarily affect the
amount of storage used. Indeed, some that do not know you better, but
do know I18N *implementation* well, might feel that you are confused.

The real issues are:

1) Do we use 16, or 31 bit code points (i.e. do we decide to use
   Unicode, or ISO 10646)?
2) What is the representation that an XML application must pass back? 

I would favor using ISO 10646 as coded character set to use for the
SGML declaration for XML, and to specify that the character
*repertiore* available within XML, is that of ISO 10646. I could be
convinced to line up with Unicode in this regard.

However, I most certainly do *NOT* think that we have any business
defining what the processor hands back. This is purely an
implementation issue, and not one that belongs in XML-lang. I can
return a stream of 31 bit character coded in any number of different
encodings. I might return then as UTF in my application, or as UCS, or
as a string encoded using hex digits.

There is one more issue, and that is the question of how the
application represents/interprets characters. I personally like to
view characters as a purely abstract object, thereby leaving the
widest possible choice of implementation strategies, though this does
not seem to be the model favoured by SGML (this *is* the model for
HTML).
Received on Wednesday, 11 June 1997 20:06:04 UTC