- From: Tim Bray <tbray@textuality.com>
- Date: Fri, 13 Jun 1997 09:33:00 -0700
- To: w3c-sgml-wg@w3.org
It would appear that I'm the only one in the world who thinks it would be desirable to specify 16-bit quanta for character passing, and the use of the Unicode past-BMP scheme. Oh, well, not quite, the people who build Java, Netscape, and Windows/NT also take that view. Now perhaps we work to a higher standard of purity here, but it seems highly questionable to take XML charging off in a direction that's incompatible with actual industry practice. Several have asserted that we should just say nothing. While we have not undertaken the task of an XML API, specifying character quanta is a very small API chunk with a huge reward in interoperability. The supposed benefit is increased abstraction. We don't want abstraction, we want lightweight, working, interoperable applications. The #1 difference between SGML and XML is that we abandoned abstract syntax. There is no stronger case for abstract syntax in characters than in markup delimiters (for XML - for SGML, abstraction is obviously the way to go). Some have asserted that for past-BMP chars, the char references should be in one chunk (e.g. �, which is from the Unicode surrogate area, but is not a real example because there are no such characters yet). This seems indubitably correct, but for interoperability the processor should still pass two 16-bit surrogates to the app. [Note for fans of *really* exotic characters - the Unicode surrogate mechanism sets aside 131K character positions for 31-bit private-use characters.] There have been several assertions, without supporting arguments, that we should take the ISO flat-31-bit-space mode. Unless I hear some good reasons, wasting 50% of the character-passing bandwidth in order to support 0.00005% of characters, which characters have never heretofore been available to computers, just seems like rank stupidity. Having said all this, let's take some points in context. Dave Peterson: >It only makes sense to represent >high-order 10646 characters via a single long numeral, such as up to >eight digits hex. I agree. Dave again: >I heartily agree that we should not be prescribing the representation >of characters used internally within a software system, including >between its components (like between the XML-processor and an application >coupled thereto). James Clark >It should be able to pass any representation of the character it finds >convenient. As assertions, these aren't good enough. The benefit of specifying the character representations is an immense increase in international interoperability. Is there a remotely comparable cost? James again >Not all scripts have combining characters. If I am working with a script >that doesn't have combining characters and does use a lot of characters >outside the BMP (Chinese for example), then it would make sense to use >internally a 32-bit fixed width encoding. No, because you will be wasting 99.999999% of all your internal character buffers. Maybe memory is just not an issue in these apps? It is not in fact the case that Chinese texts "use lots of characters outside the BMP" - in fact, all the Chinese apps of today use none, and they seem to get by. Anyhow, a similar argument could say "I know I'm processing English, thus I'll just use 7-bit ASCII for everything". I think that neither this nor the Chinese-only attitude described above is in the spirit of what we've done so far, and we want to build a powerful disincentive to this kind of sloppiness into the spec. James again >The place where this needs to be addressed is when you do a binding of the >DOM to a particular programming language. Right... but I had hoped for XML apps to be interoperable for other APIs than the DOM. Dan Connolly: >A character encoding scheme over some repertoire is an algorithm or >function that maps a sequence of octets into a sequence of characters >in the repertoire. On the other hand, a coded character set C over some >repertoire maps each character H in the repertoire to a non-negative >integer called a code... Dan is catching me in a gross error - in my original post on this I discussed "encodings", which is bogus... I wasn't talking about the actual encodings in the entity, I was talking about what the processor, having read the entity, passes to the app. Dan's discussion of encoding is way more precise than what's in the spec; it is a useful argument (but not the one we're having here) whether the spec should be recast this way, or left in the way it is, where it basically punts encodings and says the processor should do the best it can, but pass Unicode/10646 chars to the application. Dan again: >A character is an atomic unit of communication; >it is not composed of bits. A character >can be encoded by a sequence of octets, or represented by >a code position (an integer) in a coded character set. That's the key disagreement. The analogy to SGML is clear; SGML says an element is an abstract thingie in a document that can be delimited by any of an infinite number of different syntaxes, or not delimited at all in the case of minimization. XML says an element is something that is delimited by tags with a fixed syntax. The position I'm advancing is that XML do the same deliberate abandonment of abstraction at the character level, saying characters are indeed the bit patterns described in Unicode, with the semantics and processing characteristics described in Unicode, and that's all there is to it. I would *not* support this position for SGML. Gavin Nicol: >Also, intuitively, this makes >sense, because a character *is* an abstract object. In XML it doesn't have to be. Postscript: It would be kind of nice if the representatives of companies on this list, who have collectively invested billions of dollars in Unicode-compliant APIs, would step forward to explain why they think this is a good idea. Cheers, Tim Bray tbray@textuality.com http://www.textuality.com/ +1-604-708-9592
Received on Friday, 13 June 1997 12:34:55 UTC