Re: those predeclared entity refs from Jon Bosak on 1997-03-13 (w3c-sgml-wg@w3.org from March 1997)

From: Jon Bosak <bosak@atlantic-83.Eng.Sun.COM>
Date: Thu, 13 Mar 1997 08:56:19 -0800
To: w3c-sgml-wg@w3.org
CC: bosak@atlantic-83.Eng.Sun.COM
Message-Id: <199703131656.IAA07109@boethius.eng.sun.com>

Paul Grosso's question about the five predeclared character entities
suddenly brought into focus something that has been bothering me since
November but hadn't until that moment come to the front of
consciousness.

Back in November, we were just about to include nearly 200 predefined
character entities to bring XML into alignment with HTML 4.0 when
Anders Berglund pointed out that the entity names for Greek scientific
characters were hopelessly mixed up with the names for the other Greek
characters, and we cut back to the five that are needed to escape just
those characters that have uses in markup.  We rationalized this by
saying that all the other characters could be referred to numerically
if necessary, and that in a Unicode context any scheme that conferred
special status on the couple of hundred characters that Europeans
happened to find especially useful was suspect anyway.

What I now realize is that the same logic applies to the remaining
five as well.

Assuming for a moment the most likely syntax for hex character
references, we will have no less than three different ways of
referring to the five syntactically meaningful characters in the
absence of any user definitions:

   QUOTATION MARK          &quot;  &#34;   &#x22;
   AMPERSAND               &amp;   &#38;   &#x26;
   APOSTROPHE              &apos;  &#39;   &#x27;
   LESS-THAN SIGN          &lt;    &#60;   &#x3c;
   GREATER-THAN SIGN       &gt;    &#62;   &#x3e;

We don't need three different ways to do this.  I propose that we
eliminate all problems relating to the predefined character entities
by simply not having any.

Here are the possible counterarguments that I can think of.

Objection 1:

  People are used to using &lt, &gt, etc.

Response:

  a. They are also used to using all the other ones predefined in HTML but
     not predefined in XML.

  b. They are used to using them without trailing semicolons under some
     circumstances; we don't allow that.

Objection 2:

  The names quot, amp, apos, lt, and gt are significantly shorter than
  the corresponding numeric forms.  Oops, no they're not.  Forget that
  one.

Objection 3:

  The names quot, amp, apos, lt, and gt are significantly easier to
  remember than the corresponding numeric forms.

Response:

  Easier for people whose native language is other than English,
  i.e., most of humanity?  I don't think so.

In return for not predefining character entities we get the following
benefits:

1. Language neutrality.

2. Cleaner specification.

3. Zero incursion into the user's name space for character entities.

4. Elimination of nitty objections to the names apos and quot (which
   we haven't talked about much but are out there).

5. Elimination of all other issues relating to predefined character
   entities (such as the one that Paul raised).

Jon

Received on Thursday, 13 March 1997 11:56:27 UTC