Re: Comments on 31 March spec
At 8:37 AM 4/9/97, Terry Allen wrote:
>I prefer 10646 partly for the reasons the IAB chose it over
>Unicode: it has room for expansion without resorting to
>Unicode's "surrogate extension mechanism" and it is the
>product of a more legitimate organization (no brickbats,
>please; Paul asked, I'm answering).
>The surrogate extension mechanism, in which two 16-bit
>codes are paired to represent characters (or whatnot)
>will be required as soon as the 10646 people specify
>whatnots beyond the BMP. According to what I read on
>the Unicode list (though I may have fallen off it), this
>is going to happen soonish, when some tens of thousands
>of Chinese characters are specified. This pairing
>mechanism is really ISO-2022 in disguise, to my mind;
>might as well go with 10646 instead.
As I understand it, XML is expecting to use the "new SGML
character model" which potentially decouples the representations
from the "document character set", and more specifically XML
plans to use that capability to permit transmission and storage
in multiple representations. This means that the specification
of document character set is only for the purpose of resolving
numeric character references.
Selecting 10646 or Unicode (a particular release/version thereof)
sets the interpretation of lots of character numbers. Probably
these interpretations will be hardwired into simple XML-only
systems. Extending the character set to a new, larger release
(or even an entirely new representation) means a new rev of XML
and new versions of the XML systems to support that new rev.
Note that there is always the "out-of-band information" solution:
Character numbers may by the document character set be specified
as "legal but undefined", AKA "non-SGML", and then it is up to
cooperating systems to agree externally on the meanings of these
The separation of document character set from internal/storage/
transmission representations brings up some interesting speculation.
E.g., what is a system to do when presented with a character for
which it has no internal representation? But that's another