Re: back to XML: The "You Babe" problem

On Wed, 15 Jan 1997 17:45:39 -0500 Christopher R. Maden said:
>I'd still like to hear a counterargument to defining the reserved
>Unicode stuff as case-insensitive, non-namespace-eating, looking-like-
>charrefs, FUNCHARS (&#U-bABe;).

The note describing the ERB's decision on this question (23 Oct 1996)
says the following:

    C.5 if XML uses ISO 10646, should there be a special form of
    character reference using hexadecimal, not decimal, numbers,
    since most references to ISO 10646 and Unicode use hex, not
    decimal (9.5)?


    Agreed (Clark dissenting) to specify that XML documents may
    refer to characters in ISO 10646 using the form '&u-' or '&U-'
    followed by four hexadecimal digits, followed by semicolon.

    Rationale: Unicode and ISO 10646 documentation is in
    hexadecimal, not decimal, so this constitutes a small but
    important convenience and aid to reliability.  The proposal to
    use '&u' was preferred to the '&#u' proposal since it is
    believed to allow SGML systems to handle these references (which
    appear to an SGML parser to be general entity references) using
    a default entity declaration.  (Consult James Clark for
    details.)

So much for the official version; what follows is my own thinking,
for which the other members of the ERB are not responsible.

I don't know how many current systems are in a position to set up
a way of handling default entities that would allow them to handle
these hex references correctly, but it appears at least possible
within the framework of 8879.

That's long-term solution no. 1:  SGML parsers get Clever Default
Entity Handling (motivated primarily, I assume, by XML and a
public-spirited desire to be part of the Great Spread of SGML
Across the Globe).

Long-term solution no. 2:  SGML97 gets hex character references.
This has been proposed (it's in a U.S. position paper on the
WG8 web site, for which I cannot now remember the URL), though
there are some hitches finding a syntax which everyone agrees on
and which doesn't involve re-interpreting some pathological cases
in hypothetically existent current documents.  If I understand the
current state of play correctly, many think there is no alternative
but the introduction of yet another delimiter, for hex/octal character
reference open.  If this works, it's much the best solution, and
I hope you will ALL WRITE YOUR NATIONAL BODIES AND SAY YOU WANT IT.

And what should we do in the short term?

For SGML systems that cannot handle hex references using clever
default-entity tricks, we seem to have the choice of (a) providing a
standard set of entity declarations including all such characters, to
be included in the DTD, (b) providing an SGML declaration defining
all the possible hex references as names, (c) providing entity
declarations or character-name declarations just for the characters
actually present in the document, or (d) eliminating all such hex
references by replacing them with decimal character references.  All
of these have disadvantages and (presumably) advantages; we just have
to pick.

Choices (a) and (b) seem likely to strain the symbol-table space of
most parsers (I know that when writing code *I* for one often make
simplistic assumptions like "The symbol table will fit within the
address space of an IBM 3090" -- or even of a DOS machine).  Choice
(c) would seem to involve doing a one-off SGML prolog for each XML
document, or for each set of XML documents, to be processed with a
full-SGML system, which gives some people hives.  Choice (d)
requires running an preprocessor on XML before submitting it to a
Full-SGML system, or at least to a sufficiently dim full-SGML system,
and our goals seem to suggest it should not be the *required* or
*recommended* solution:  our goals say XML document instances should
be legal SGML document instances, as they stand, without filtering,
as long as they are provided with a suitable prolog.

As I say, the thought that the required prolog might vary from
instance to instance bothers some people a lot, but (c) seems the
most plausible method of handling this in the short run, or even in
the long run if SGML 97 doesn't have hex references and we are stuck
with legacy parsers lacking Clever Default Entity Handling.  In
choice (c), I'm not sure it really matters whether the hex references
are declared as entities or as character names (are they really all
FUNCHARS?  I don't have the standard handy) in theory, but in
practice it seems easier to me if I can have a standard unchanging
SGML declaration for XML documents, and confine my instance-specific
tweaks to adding some declarations to the internal subset.  Not
a conclusive argument, but it does seem to me to outweight the
advantage -- if it is one -- of case-insensitive hex references.

If we do have case-sensitive references, here's my vote for
uppercase.  Uppercase is more visible, and lowercase hex looks funny.

-C. M. Sperberg-McQueen

Received on Thursday, 16 January 1997 10:02:31 UTC