- From: Dan Connolly <connolly@w3.org>
- Date: Wed, 11 Jun 1997 23:59:29 -0500
- To: Tim Bray <tbray@textuality.com>
- CC: w3c-sgml-wg@w3.org
Tim, I have already suggested wording that I believe resolves this issue. In private correspondence, you dismissed it, but then you raise the issue here. I would like you to directly address the text I submitted. Here it is again for handy reference: ============== http://www.w3.org/XML/Group/9705/xml-spec.html $Date: 1997/05/31 07:17:23 $ Text Encoding The basic unit of XML interchange, the text entity, is composed of characters; but computer systems generally store and exchange information composed of bytes or octets. In particular, a text entity is encoded in internet mail[MIME] and Hypertext Transfer Protocol[HTTP@@] as a head and a body; the body is a sequence of octets, and the head identifies a character encoding scheme. A character encoding scheme over some repertoire is an algorithm or function that maps a sequence of octets into a sequence of characters in the repertoire. On the other hand, a coded character set C over some repertoire maps each character H in the repertoire to a non-negative integer called a code position of H in C. A character encoding scheme S encodes a text entity T as a sequence of octets E iff the S algorithm produces T when given E as input. For example, US-ASCII is a simple character encoding scheme used extensively in internet mail[MIME]. It is based on the ASCII coded character set[ASCII@@] which assigns code position 65 to 'A', 66 to 'B', etc. Since the repertoire is fairly small, all code positions are between 0 and 127 and the encoding is straightforward: each character in a sequence is encoded as the octet corresponding to its code position. So US-ASCII encodes "ABC" as the sequence of octets 65, 66, 67. The ASCII coded character set is not sufficient for a global information system such as the web. [ISO-10646@@] defines a coded character set over a repertoire of thousands of characters used by people all over the world. The simple byte-per-character technique is not sufficient for text entities over such a large character repertoire. UCS-2[@@] is a character encoding scheme over the Basic Multilingual Plane of [ISO-10646]. The code positions of this repertoire are between 0 and 65,536; hence each character can be encoded as two octets. UCS-2 encodes "ABC" as 0, 65, 0, 66, 0, 67. (@@verify this) @@byte order mark: algorithm of the UCS-2 scheme produces no characters for the first two octets if they are U+FEFF or U+FFEF. Hence UCS-2 also encodes "ABC" as FE, FF, 0, 65, 0, 66, 0, 67. UTF-8[@@] is a character encoding scheme over the whole [ISO-10646@@] character repertoire. Characters at code positions up to 127 are encoded as one byte; other characters are encoded as two, three, four, five, or six bytes. T is simply encoded as E iff UTF-8 encodes T as E or UCS-2 encodes T as E and and E begins with a byte-order mark. T is verifyably encoded as E iff T is simply encoded as E or T begins with an encoding declaration for an encoding scheme S and S encodes T as E @@ include notes to implementors from "E. Autodection..." ============== This suggested wording is mathematically precise, internally consistent, and externally consistent with ISO 10646, the Unicode 2.0 spec, the MIME specs, the HTML I18N specs, and an immense body of correspondence between the IESG and folks like Gary Adams, Francois Yergeau (sp?), etc. There's another way of looking at things, where a sequence of octets is mapped to a sequence of code positions via a BTCF (I forget what that stands for) and those code positions are mapped to characters via a coded character set; that's an equally internally consistent way of looking at things. But externally, it's more aligned with the terminology in a Hytime corrigendum that I'm not intimately familar with, and less aligned with the terminology in the MIME specs. It doesn't really matter: both views of the world are consistent with each other. You can measure length in inches or in meters and it doesn't matter as long as you agree that an inch is 0.0254 meters. Choose either one, but let's cut out this hand-waving about "16 bit characters." Tim Bray wrote: > > Right now, the spec references both Unicode 2.0 and ISO 10646. These > each define 30-thousand-odd characters. They are the same characters, > and they have the same encoding. I think you mean that both coded character sets assing the same code positions for the same characters. Each standard defines multiple encodings (i.e. character encoding schemes) and so I wouldn't know what you mean by "they have the same encoding." > This is good. The XML spec says that > characters are from this set, which is fine. "set" is a term that has a very precise meaning in most contexts, but it is horribly misused in discussions of characters and text. I suggest you use the term "repertoire" in stead. > The spec is rather vague > about what the processor ought to pass the app character-wise; an > initial reading would suggest that 16-bit chars are the norm, a careful > reading reveals a couple of places where we clearly envision characters > up to 31 bits wide. A character is an atomic unit of communication; it is not composed of bits. A character can be encoded by a sequence of octets, or represented by a code position (an integer) in a coded character set. But a character is not a number or bit sequence any more than a color is. While folks might say "16 bit colors," they are being imprecise when they do so. Formally, they mean "16 bit quantities that represent colors via a mapping table." Never mind that the term "processor" is imprecisely defined and used throughout the 970331 XML spec (my suggestions also eliminate the need to do that). > is... in the spec, should we: > > a) leave it carefully vague as to what should be passed Absolutely not. > b) line up with the Unicode camp > c) line up with the ISO camp I don't see where they conflict. Could you give a specific exmaple? Is there a character whose code postition in the coded character set defined by ISO10646 is different from its code position in the Unicode spec? > ISO says that characters should always be passed around in 16-bit > chunks. That's not the way I understand it. The way I understand it, ISO10646 defines a bunch of characters by name and by code position, and it also defines some character encoding schemes in an unpublished annex: ========== ftp://ds.internic.net/rfc/rfc2044.txt [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Infor- mation technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. UTF-8 is described in Annex R, adopted but not yet published. UTF-16 is described in Annex Q, adopted but not yet published. ========== For lots of good references, see also: ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets > On the ISO side (but I'm not the right person to explain this for > reasons that will become clear below) the preference is for a flat > 31-bit character address space. Huh? So the paragraph before was about Unicode? In any case: the space of code positions is infinite. It's the integers. The number of bits is only relevant to character encoding schemes. > Having said all that, I will abandon the relatively even-handed > tone and say that I think we ought simply to line up with Unicode. > This will have the concrete effect that XML processors will be > required always to pass 16-bit chunks to applications. I don't think that's useful or necessary. My suggested wording is above. > By the > way, this is how Java works, and in a very hard-coded way. The > encoding scheme is entirely without ambiguity. Not so: Java strings are objects, and the internal encoding is not visible via the interface those objects export. The UCS-2 encoding is visible via that interface (e.g. the getChars method), and there are some methods that restrict code positions to 16 bits (a Java 'char'). But the implementation could use UTF-7, UCS-4, etc. internally and work just fine. See: http://java.sun.com:80/products/jdk/1.1/docs/api/java.lang.String.html#_top_ Note that the UTF-8 encoding (actually, a variant of it that doesn't address characters outside the BMP) is also visible via the Java API http://java.sun.com:80/products/jdk/1.1/docs/api/java.io.DataOutputStream.html#writeUTF(java.lang.String) > Also, philosophically, once you get outside the 16-bit BMP, you > are no longer dealing with characters that are routinely > available in any computer text processing system available anywhere > in the world. Forcing ourselves to use 31 bits, and thus wasting > 50% of character buffer storage in 99.999999% of all cases, seems > entirely out of the spirit of XML. I don't think we're forced to make the choice you describe. My suggested wording is above. -- Dan Connolly http://www.w3.org/People/Connolly/
Received on Thursday, 12 June 1997 00:59:21 UTC