- From: Tim Bray <tbray@textuality.com>
- Date: Thu, 12 Jun 1997 00:51:46 -0700
- To: w3c-sgml-wg@w3.org
Over the last several months, I have had contact with several members of the I18n and more specifically web-i18n community, who have pointed out a potential problem with the latest draft of XML-lang. I have discussed this with Michael, and we are not fully in agreement as to how this should move forward; I take this as prima facie evidence that there is an issue of policy here that needs input from this group. Jon has asked me to raise this here and hopefully we can sort it out in the next rev of the spec. Right now, the spec references both Unicode 2.0 and ISO 10646. These each define 30-thousand-odd characters. They are the same characters, and they have the same encoding. This is good. The XML spec says that characters are from this set, which is fine. The spec is rather vague about what the processor ought to pass the app character-wise; an initial reading would suggest that 16-bit chars are the norm, a careful reading reveals a couple of places where we clearly envision characters up to 31 bits wide. This is material, because the 30-odd-K Unicode/ISO now have do not include all the Chinese characters there are or ever have been (although it does include all of those that are typically available on computer systems). The Chinese folks have several tens of thousands more getting queued up for addition. Also incoming are some dead scripts such as Aztec, Maya, and (I have heard) Tolkien-Elvish and Klingon. One way or another, this is going to spill over the 64k limit. And unfortunately, once you get past 64k, Unicode and ISO no longer are in a state of happy unity. The issue of policy we have to decide is... in the spec, should we: a) leave it carefully vague as to what should be passed b) line up with the Unicode camp c) line up with the ISO camp Now here's a problem. I'm not sure it would be appropriate for me, in this forum, to explain what these options are and what they mean. Anybody who wants to pitch in on this issue should really Really *REALLY* go and pick up the Unicode 2.0 standard and read it. It is kind of expensive, but an all-around good piece of work that is a pleasure to read. Having said this, the following is a vastly oversimplified summary of the ISO & Unicode world-views, provided only as a teaser to motivate you to go and read up: ISO says that characters should always be passed around in 16-bit chunks. It reserves two blocks of 1024 chars each that will never be used for other purposes called "low surrogate" and "high surrogate". For characters that extend past the Basic Multilingual Plane (the basic 64K 16-bit chars) they are given in two 16-bit chunks, the first of which must come from the low surrogate block, the second from the high surrogate block. This gets you about a million extra characters, organized in 16 planes of 64K chars. The encoding is completely unambiguous, you can look at any 16-bit quantity and if it's half of a 32-bit character, you know. A system that doesn't know this stuff that gets one of these would display it as two blobs on the screen. A system that knew the basic schema, but not the actual 32-bit char, would display it as one blob. A system that knew the big character could actually display it. On the ISO side (but I'm not the right person to explain this for reasons that will become clear below) the preference is for a flat 31-bit character address space. There are a variety of reasons for this; the one that speaks most clearly to me is based on history: we thought 16-bit computers were enough, then we thought 32-bit computers were enough, let's not do this to ourselves again. ============================================================== Having said all that, I will abandon the relatively even-handed tone and say that I think we ought simply to line up with Unicode. This will have the concrete effect that XML processors will be required always to pass 16-bit chunks to applications. By the way, this is how Java works, and in a very hard-coded way. The encoding scheme is entirely without ambiguity. I have no sympathy for the ISO claim that the 31-bit version is more fixed-width in any meaningful sense, since Unicode is full of combining characters anyhow. Also, philosophically, once you get outside the 16-bit BMP, you are no longer dealing with characters that are routinely available in any computer text processing system available anywhere in the world. Forcing ourselves to use 31 bits, and thus wasting 50% of character buffer storage in 99.999999% of all cases, seems entirely out of the spirit of XML. - Tim
Received on Wednesday, 11 June 1997 18:53:13 UTC