- From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
- Date: Sat, 22 Mar 97 16:23:14 CST
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Having just read through the entire thread about the predefined entities, several things seem clear to me, though the correct solution to be built on that clarity is not quite so clear. - predefining lt, amp, gt, quot, and apos is problematic primarily because there is nothing in SGML that works quite this way. A conforming SGML system is going to want to see entity declarations for these characters, while an XML processor is going to flag such declarations as errors. - named character references in SGML do work almost this way (Lee suggested this), but not quite: while numeric character references do cause the character in question to be treated as data rather than a delimiter, named character references do not. If we had a charset declaration that defined LT or STAGO as a function, the string &#LT;gi> would be legal. But it would not work the way <gi> does. It would work the way <gi> does. (Of course, I'm writing this from home, without the Handbook handy, so I may have gotten this turned around Yet Again.) - there are some serious problems in the phrasing of the current draft spec. It can indeed be read to mean you *must* use < to escape <, so the following will not work: <!ENTITY stago '<' > ... Tag a paragraph using &stago;p>. The < in the entity text of 'stago' is not a delimiter, and thus should not appear in literal form. Various posters are correct to say the spec should probably not say this; various other posters are correct to say that the spec does however say this. The following, though, might be considered defensible (I'd defend it), on the grounds that the spec doesn't forbid it, and < is properly escaped where it does occur: <!ENTITY stago '<' > ... Tag a paragraph using &stago;p>. - The current draft spec successfully avoids having to get into the theory and practice of the entity-end signal (that magic non-character which appears in the character stream and ensures that the string <p> expands to '<', EE, 'p', '>' won't be read as a tag, since < is not followed by a name-start character). It's not clear to me how to generalize our current treatment without having to reintroduce EE. - Numeric character references have none of the problems identified with lt, amp, etc. Their only problem is that they are not much fun to memorize, since most of us find letters and abbreviations in some language we speak more mnemonic than numbers. (Ramanujan, Gauss, and others appear to have been exceptions, I admit it. But though they may be smarter than the rest of us, they don't outnumber us.) Thinking it over, I think John Lavagnino had the best idea: require escaping, show how to do escaping by means of numeric character refs, and show examples showing how to define lt and amp yourself. Mnemonic or non-mnemonic, numeric character references have the obvious advantages that they work as is -- we merely have to make explicit in the spec how they work -- and if we rely on them instead of gt, lt, amp, apos, and quote, we can eliminate all the complications of predefined entities as a concept. Tim and others are right that users will say "But I want lt and amp!" That's why we should make sure the spec shows them how to do it. In sum: - yes, we should lose the predefined entities. - we should include examples showing how to define amp and lt, etc. If developers of XML products include any kind of intelligent error recovery for undeclared entities, guessing that lt is < and amp is & is not exactly rocket science; I don't think it will be all that hard to provide users with the same level of convenience as would be provided by the predefined entities. -C. M. Sperberg-McQueen
Received on Saturday, 22 March 1997 18:00:33 UTC