- From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
- Date: Thu, 16 Jan 97 08:26:31 CST
- To: W3C SGML Working Group <w3c-sgml-wg@www10.w3.org>
On Wed, 15 Jan 1997 17:45:39 -0500 Christopher R. Maden said: >I'd still like to hear a counterargument to defining the reserved >Unicode stuff as case-insensitive, non-namespace-eating, looking-like- >charrefs, FUNCHARS (&#U-bABe;). The note describing the ERB's decision on this question (23 Oct 1996) says the following: C.5 if XML uses ISO 10646, should there be a special form of character reference using hexadecimal, not decimal, numbers, since most references to ISO 10646 and Unicode use hex, not decimal (9.5)? Agreed (Clark dissenting) to specify that XML documents may refer to characters in ISO 10646 using the form '&u-' or '&U-' followed by four hexadecimal digits, followed by semicolon. Rationale: Unicode and ISO 10646 documentation is in hexadecimal, not decimal, so this constitutes a small but important convenience and aid to reliability. The proposal to use '&u' was preferred to the '&#u' proposal since it is believed to allow SGML systems to handle these references (which appear to an SGML parser to be general entity references) using a default entity declaration. (Consult James Clark for details.) So much for the official version; what follows is my own thinking, for which the other members of the ERB are not responsible. I don't know how many current systems are in a position to set up a way of handling default entities that would allow them to handle these hex references correctly, but it appears at least possible within the framework of 8879. That's long-term solution no. 1: SGML parsers get Clever Default Entity Handling (motivated primarily, I assume, by XML and a public-spirited desire to be part of the Great Spread of SGML Across the Globe). Long-term solution no. 2: SGML97 gets hex character references. This has been proposed (it's in a U.S. position paper on the WG8 web site, for which I cannot now remember the URL), though there are some hitches finding a syntax which everyone agrees on and which doesn't involve re-interpreting some pathological cases in hypothetically existent current documents. If I understand the current state of play correctly, many think there is no alternative but the introduction of yet another delimiter, for hex/octal character reference open. If this works, it's much the best solution, and I hope you will ALL WRITE YOUR NATIONAL BODIES AND SAY YOU WANT IT. And what should we do in the short term? For SGML systems that cannot handle hex references using clever default-entity tricks, we seem to have the choice of (a) providing a standard set of entity declarations including all such characters, to be included in the DTD, (b) providing an SGML declaration defining all the possible hex references as names, (c) providing entity declarations or character-name declarations just for the characters actually present in the document, or (d) eliminating all such hex references by replacing them with decimal character references. All of these have disadvantages and (presumably) advantages; we just have to pick. Choices (a) and (b) seem likely to strain the symbol-table space of most parsers (I know that when writing code *I* for one often make simplistic assumptions like "The symbol table will fit within the address space of an IBM 3090" -- or even of a DOS machine). Choice (c) would seem to involve doing a one-off SGML prolog for each XML document, or for each set of XML documents, to be processed with a full-SGML system, which gives some people hives. Choice (d) requires running an preprocessor on XML before submitting it to a Full-SGML system, or at least to a sufficiently dim full-SGML system, and our goals seem to suggest it should not be the *required* or *recommended* solution: our goals say XML document instances should be legal SGML document instances, as they stand, without filtering, as long as they are provided with a suitable prolog. As I say, the thought that the required prolog might vary from instance to instance bothers some people a lot, but (c) seems the most plausible method of handling this in the short run, or even in the long run if SGML 97 doesn't have hex references and we are stuck with legacy parsers lacking Clever Default Entity Handling. In choice (c), I'm not sure it really matters whether the hex references are declared as entities or as character names (are they really all FUNCHARS? I don't have the standard handy) in theory, but in practice it seems easier to me if I can have a standard unchanging SGML declaration for XML documents, and confine my instance-specific tweaks to adding some declarations to the internal subset. Not a conclusive argument, but it does seem to me to outweight the advantage -- if it is one -- of case-insensitive hex references. If we do have case-sensitive references, here's my vote for uppercase. Uppercase is more visible, and lowercase hex looks funny. -C. M. Sperberg-McQueen
Received on Thursday, 16 January 1997 10:02:31 UTC