Re: I18N issue needs consideration from Dan Connolly on 1997-06-12 (w3c-sgml-wg@w3.org from June 1997)

From: Dan Connolly <connolly@w3.org>
Date: Wed, 11 Jun 1997 23:59:29 -0500
To: Tim Bray <tbray@textuality.com>
CC: w3c-sgml-wg@w3.org
Message-ID: <339F8231.EA3@w3.org>
Tim,

I have already suggested wording that I believe
resolves this issue.
In private correspondence, you dismissed it, but then
you raise the issue here. I would like you to directly
address the text I submitted.
Here it is again for handy reference:

==============
http://www.w3.org/XML/Group/9705/xml-spec.html
$Date: 1997/05/31 07:17:23 $

Text Encoding

The basic unit of XML interchange, the text entity, is composed of
characters;
but computer systems generally store and exchange information composed
of
bytes or octets. In particular, a text entity is encoded in internet
mail[MIME] and
Hypertext Transfer Protocol[HTTP@@] as a head and a body; the body is a
sequence of octets, and the head identifies a character encoding scheme. 

A character encoding scheme over some repertoire is an algorithm or
function
that maps a sequence of octets into a sequence of characters in the
repertoire.
On the other hand, a coded character set C over some repertoire maps
each
character H in the repertoire to a non-negative integer called a code
position of
H in C. 

A character encoding scheme S encodes a text entity T as a sequence of
octets
E iff the S algorithm produces T when given E as input. 

For example, US-ASCII is a simple character encoding scheme used
extensively
in internet mail[MIME]. It is based on the ASCII coded character
set[ASCII@@]
which assigns code position 65 to 'A', 66 to 'B', etc. Since the
repertoire is fairly
small, all code positions are between 0 and 127 and the encoding is
straightforward: each character in a sequence is encoded as the octet
corresponding to its code position. So US-ASCII encodes "ABC" as the
sequence of octets 65, 66, 67. 

The ASCII coded character set is not sufficient for a global information
system
such as the web. [ISO-10646@@] defines a coded character set over a
repertoire of thousands of characters used by people all over the world.
The
simple byte-per-character technique is not sufficient for text entities
over such a
large character repertoire. 

UCS-2[@@] is a character encoding scheme over the Basic Multilingual
Plane
of [ISO-10646]. The code positions of this repertoire are between 0 and
65,536;
hence each character can be encoded as two octets. UCS-2 encodes "ABC"
as
0, 65, 0, 66, 0, 67. (@@verify this) @@byte order mark: algorithm of the
UCS-2
scheme produces no characters for the first two octets if they are
U+FEFF or
U+FFEF. Hence UCS-2 also encodes "ABC" as FE, FF, 0, 65, 0, 66, 0, 67. 

UTF-8[@@] is a character encoding scheme over the whole [ISO-10646@@]
character repertoire. Characters at code positions up to 127 are encoded
as
one byte; other characters are encoded as two, three, four, five, or six
bytes. 

T is simply encoded as E iff 

     UTF-8 encodes T as E or 
     UCS-2 encodes T as E and and E begins with a byte-order mark. 

T is verifyably encoded as E iff 

     T is simply encoded as E or 
     T begins with an encoding declaration for an encoding scheme S and
S
     encodes T as E 

@@ include notes to implementors from "E. Autodection..." 

==============

This suggested wording is mathematically precise, internally
consistent, and externally consistent with ISO 10646, the
Unicode 2.0 spec, the MIME specs, the HTML I18N specs,
and an immense body of correspondence between the IESG
and folks like Gary Adams, Francois Yergeau (sp?),
etc.

There's another way of looking at things, where a sequence
of octets is mapped to a sequence of code positions via
a BTCF (I forget what that stands for) and those code positions
are mapped to characters via a coded character set; that's
an equally internally consistent way of looking at things.
But externally, it's more aligned with the terminology
in a Hytime corrigendum that I'm not intimately familar with,
and less aligned with the terminology in the MIME specs.
It doesn't really matter: both views of the world are
consistent with each other. You can measure length in
inches or in meters and it doesn't matter as long as you
agree that an inch is 0.0254 meters.

Choose either one, but let's cut out this hand-waving
about "16 bit characters."


Tim Bray wrote:
> 
> Right now, the spec references both Unicode 2.0 and ISO 10646.  These
> each define 30-thousand-odd characters.  They are the same characters,
> and they have the same encoding.

I think you mean that both coded character sets assing the same
code positions for the same characters. Each standard defines
multiple encodings (i.e. character encoding schemes) and
so I wouldn't know what you mean by "they have the
same encoding."


>  This is good.  The XML spec says that
> characters are from this set, which is fine.

"set" is a term that has a very precise meaning in most
contexts, but it is horribly misused in discussions of
characters and text. I suggest you use the term "repertoire"
in stead.

>  The spec is rather vague
> about what the processor ought to pass the app character-wise; an
> initial reading would suggest that 16-bit chars are the norm, a careful
> reading reveals a couple of places where we clearly envision characters
> up to 31 bits wide.

A character is an atomic unit of communication;
it is not composed of bits. A character
can be encoded by a sequence of octets, or represented by
a code position (an integer) in a coded character set.

But a character is not a number or bit sequence any more than
a color is. While folks might say "16 bit colors," they are
being imprecise when they do so. Formally, they mean "16 bit
quantities that represent colors via a mapping table."

Never mind that the term "processor" is imprecisely defined
and used throughout the 970331 XML spec (my suggestions
also eliminate the need to do that).

> is... in the spec, should we:
> 
> a) leave it carefully vague as to what should be passed

Absolutely not.

> b) line up with the Unicode camp
> c) line up with the ISO camp

I don't see where they conflict. Could you give a specific
exmaple? Is there a character whose code postition
in the coded character set defined by ISO10646 is different
from its code position in the Unicode spec?

> ISO says that characters should always be passed around in 16-bit
> chunks.

That's not the way I understand it. The way I understand it,
ISO10646 defines a bunch of characters by name and by code position,
and it also defines some character encoding schemes in an unpublished
annex:

==========
ftp://ds.internic.net/rfc/rfc2044.txt

   [ISO-10646]    ISO/IEC 10646-1:1993. International Standard -- Infor-
                  mation technology -- Universal Multiple-Octet Coded
                  Character Set (UCS) -- Part 1: Architecture and Basic
                  Multilingual Plane.  UTF-8 is described in Annex R,
                  adopted but not yet published.  UTF-16 is described in
                  Annex Q, adopted but not yet published.
==========

For lots of good references, see also:
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets


> On the ISO side (but I'm not the right person to explain this for
> reasons that will become clear below) the preference is for a flat
> 31-bit character address space.

Huh? So the paragraph before was about Unicode?

In any case: the space of code positions is infinite. It's
the integers. The number of bits is only relevant to character
encoding schemes.

> Having said all that, I will abandon the relatively even-handed
> tone and say that I think we ought simply to line up with Unicode.
> This will have the concrete effect that XML processors will be
> required always to pass 16-bit chunks to applications.

I don't think that's useful or necessary. My suggested wording
is above.

>  By the
> way, this is how Java works, and in a very hard-coded way.  The
> encoding scheme is entirely without ambiguity.

Not so: Java strings are objects, and the internal encoding
is not visible via the interface those objects export. The
UCS-2 encoding is visible via that interface (e.g. the getChars method),
and there are some methods that restrict code positions to 16 bits (a
Java 'char'). But the implementation could use UTF-7,
UCS-4, etc. internally and work just fine.

See:
http://java.sun.com:80/products/jdk/1.1/docs/api/java.lang.String.html#_top_

Note that the UTF-8 encoding (actually, a variant of it that doesn't
address characters outside the BMP) is also visible via the Java API

http://java.sun.com:80/products/jdk/1.1/docs/api/java.io.DataOutputStream.html#writeUTF(java.lang.String)

> Also, philosophically, once you get outside the 16-bit BMP, you
> are no longer dealing with characters that are routinely
> available in any computer text processing system available anywhere
> in the world.  Forcing ourselves to use 31 bits, and thus wasting
> 50% of character buffer storage in 99.999999% of all cases, seems
> entirely out of the spirit of XML.

I don't think we're forced to make the choice you describe. My
suggested wording is above.

-- 
Dan Connolly
http://www.w3.org/People/Connolly/
Received on Thursday, 12 June 1997 00:59:21 UTC