- From: James Clark <jjc@jclark.com>
- Date: Thu, 12 Jun 1997 11:07:29 +0700
- To: w3c-sgml-wg@w3.org
At 00:51 12/06/97 -0700, Tim Bray wrote: >The spec is rather vague >about what the processor ought to pass the app character-wise; It should be able to pass any representation of the character it finds convenient. > an >initial reading would suggest that 16-bit chars are the norm, a careful >reading reveals a couple of places where we clearly envision characters >up to 31 bits wide. It is passing characters to the application. We shouldn't care how they are represented. You don't have to represent characters by an integer. In general the right choice will depend on the language being used. If the language has built in support for a 16-bit representation it may will be convenient to use that. If it has built-in support for a 32-bit representation it may well be convenient to use that. >ISO [you mean Unicode] says that characters should always be passed around in 16-bit >chunks. It reserves two blocks of 1024 chars each that will never >be used for other purposes called "low surrogate" and "high surrogate". >For characters that extend past the Basic Multilingual Plane (the >basic 64K 16-bit chars) they are given in two 16-bit chunks, the >first of which must come from the low surrogate block, the second >from the high surrogate block. This gets you about a million extra >characters, organized in 16 planes of 64K chars. The encoding is >completely unambiguous, you can look at any 16-bit quantity and >if it's half of a 32-bit character, you know. Isn't that the same as UTF-16 which is in Amendment 1 to ISO/IEC 10646? The way I look at it is that it's a variable-length encoding in which some characters are represented by 2 bytes and others by 4 bytes. It has the nice property that documents encoded with UTF-16 will mostly work OK with applications that think they're dealing with UCS-2, just as documents encoded with UTF-8 often work OK with applications that think they're dealing with ASCII. But the two 16-bit surrogates aren't proper characters any more than are each individual byte in a multibyte sequence encoding a character in UTF-8. We shouldn't be mandating a particular character encoding for an API. An application should be able to use UTF-8 or UTF-16 or UCS-4, as it sees fit. >Having said all that, I will abandon the relatively even-handed >tone and say that I think we ought simply to line up with Unicode. >This will have the concrete effect that XML processors will be >required always to pass 16-bit chunks to applications. By the >way, this is how Java works, and in a very hard-coded way. The >encoding scheme is entirely without ambiguity. Certainly in Java it would make a lot of sense to UTF-16. But not everybody's using Java. > I have no sympathy >for the ISO claim that the 31-bit version is more fixed-width in >any meaningful sense, since Unicode is full of combining characters >anyhow. Not all scripts have combining characters. If I am working with a script that doesn't have combining characters and does use a lot of characters outside the BMP (Chinese for example), then it would make sense to use internally a 32-bit fixed width encoding. Another way to represent things internally is: struct String { unsigned short *s; size_t n; bool ucs4; }; If the ucs4 flag is true then this string represents each character in the string with two unsigned shorts; otherwise it represents it with one unsigned short. In other words each string uses a fixed width encoding, but different strings may use different widths. >Also, philosophically, once you get outside the 16-bit BMP, you >are no longer dealing with characters that are routinely >available in any computer text processing system available anywhere >in the world. Forcing ourselves to use 31 bits, and thus wasting >50% of character buffer storage in 99.999999% of all cases, seems >entirely out of the spirit of XML. Right, so don't require a particular representation. I just don't see your argument against the current position of "leave it carefully vague as to what should be passed". It's not our business in XML-lang to design APIs. The place where this needs to be addressed is when you do a binding of the DOM to a particular programming language. At that point you need to say how the data types provided by that programming language are used to represent the abstract character data type. James
Received on Thursday, 12 June 1997 00:25:22 UTC