- From: Jon Bosak <bosak@atlantic-83.Eng.Sun.COM>
- Date: Thu, 13 Mar 1997 08:56:19 -0800
- To: w3c-sgml-wg@w3.org
- CC: bosak@atlantic-83.Eng.Sun.COM
Paul Grosso's question about the five predeclared character entities suddenly brought into focus something that has been bothering me since November but hadn't until that moment come to the front of consciousness. Back in November, we were just about to include nearly 200 predefined character entities to bring XML into alignment with HTML 4.0 when Anders Berglund pointed out that the entity names for Greek scientific characters were hopelessly mixed up with the names for the other Greek characters, and we cut back to the five that are needed to escape just those characters that have uses in markup. We rationalized this by saying that all the other characters could be referred to numerically if necessary, and that in a Unicode context any scheme that conferred special status on the couple of hundred characters that Europeans happened to find especially useful was suspect anyway. What I now realize is that the same logic applies to the remaining five as well. Assuming for a moment the most likely syntax for hex character references, we will have no less than three different ways of referring to the five syntactically meaningful characters in the absence of any user definitions: QUOTATION MARK " " " AMPERSAND & & & APOSTROPHE ' ' ' LESS-THAN SIGN < < < GREATER-THAN SIGN > > > We don't need three different ways to do this. I propose that we eliminate all problems relating to the predefined character entities by simply not having any. Here are the possible counterarguments that I can think of. Objection 1: People are used to using <, >, etc. Response: a. They are also used to using all the other ones predefined in HTML but not predefined in XML. b. They are used to using them without trailing semicolons under some circumstances; we don't allow that. Objection 2: The names quot, amp, apos, lt, and gt are significantly shorter than the corresponding numeric forms. Oops, no they're not. Forget that one. Objection 3: The names quot, amp, apos, lt, and gt are significantly easier to remember than the corresponding numeric forms. Response: Easier for people whose native language is other than English, i.e., most of humanity? I don't think so. In return for not predefining character entities we get the following benefits: 1. Language neutrality. 2. Cleaner specification. 3. Zero incursion into the user's name space for character entities. 4. Elimination of nitty objections to the names apos and quot (which we haven't talked about much but are out there). 5. Elimination of all other issues relating to predefined character entities (such as the one that Paul raised). Jon
Received on Thursday, 13 March 1997 11:56:27 UTC