Re: those predeclared entity refs from Michael Sperberg-McQueen on 1997-03-22 (w3c-sgml-wg@w3.org from March 1997)

From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
Date: Sat, 22 Mar 97 16:23:14 CST
To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <199703222300.SAA24379@www10.w3.org>
Having just read through the entire thread about the predefined
entities, several things seem clear to me, though the correct
solution to be built on that clarity is not quite so clear.

- predefining lt, amp, gt, quot, and apos is problematic primarily
because there is nothing in SGML that works quite this way.  A
conforming SGML system is going to want to see entity declarations for
these characters, while an XML processor is going to flag such
declarations as errors.

- named character references in SGML do work almost this way (Lee
suggested this), but not quite:  while numeric character references do
cause the character in question to be treated as data rather than a
delimiter, named character references do not.  If we had a charset
declaration that defined LT or STAGO as a function, the string &#LT;gi>
would be legal.  But it would not work the way &lt;gi> does.  It would
work the way <gi> does.  (Of course, I'm writing this from home, without
the Handbook handy, so I may have gotten this turned around Yet Again.)

- there are some serious problems in the phrasing of the current draft
spec.  It can indeed be read to mean you *must* use &lt; to escape <, so
the following will not work:

  <!ENTITY stago '<' >
  ...
  Tag a paragraph using &stago;p>.

The < in the entity text of 'stago' is not a delimiter, and thus should
not appear in literal form.  Various posters are correct to say the
spec should probably not say this; various other posters are correct
to say that the spec does however say this.

The following, though, might be considered defensible (I'd defend it),
on the grounds that the spec doesn't forbid it, and < is properly
escaped where it does occur:

  <!ENTITY stago '&lt;' >
  ...
  Tag a paragraph using &stago;p>.

- The current draft spec successfully avoids having to get into the
theory and practice of the entity-end signal (that magic non-character
which appears in the character stream and ensures that the string &lt;p>
expands to '<', EE, 'p', '>' won't be read as a tag, since
< is not followed by a name-start character).  It's not clear to me how
to generalize our current treatment without having to reintroduce EE.

- Numeric character references have none of the problems identified with
lt, amp, etc.  Their only problem is that they are not much fun to
memorize, since most of us find letters and abbreviations in some
language we speak more mnemonic than numbers.  (Ramanujan, Gauss, and
others appear to have been exceptions, I admit it.  But though they may
be smarter than the rest of us, they don't outnumber us.)


Thinking it over, I think John Lavagnino had the best idea:  require
escaping, show how to do escaping by means of numeric character refs,
and show examples showing how to define lt and amp yourself.

Mnemonic or non-mnemonic, numeric character references have the
obvious advantages that they work as is -- we merely have to make
explicit in the spec how they work -- and if we rely on them instead
of gt, lt, amp, apos, and quote, we can eliminate all the complications
of predefined entities as a concept.

Tim and others are right that users will say "But I want lt and amp!"
That's why we should make sure the spec shows them how to do it.

In sum:

  - yes, we should lose the predefined entities.
  - we should include examples showing how to define amp and lt, etc.

If developers of XML products include any kind of intelligent
error recovery for undeclared entities, guessing that lt is < and amp
is & is not exactly rocket science; I don't think it will be all that
hard to provide users with the same level of convenience as would
be provided by the predefined entities.

-C. M. Sperberg-McQueen
Received on Saturday, 22 March 1997 18:00:33 UTC