Re: (Repeat) Decision: C.4 (Predefined entities) from lee@sq.com on 1996-11-10 (w3c-sgml-wg@w3.org from November 1996)

From: <lee@sq.com>
Date: Sun, 10 Nov 96 17:12:29 EST
To: w3c-sgml-wg@w3.org
Message-Id: <9611102212.AA16299@sqrex.sq.com>
Thanks for replying.

> | > 	&ldquo;, &rdquo; (left double quote and right double quote)
> | > 	&mdash;, &ndash;
> | But I agree that it would be good to include them.
> 
> The list that we've adopted simply mirrors a decision of the HTML ERB.
> Their decision was to include the symbol set; the characters that Paul
> mentions aren't in it.  No argument about relative usefulness here,
> just consistency of W3C recommendations.

Consistency with HTML?

> | I don't understand the point of being so careful about prefixing all
> | XML names with -XML- earlier, and now suddenly deciding to add several
> | screens' worth of fixed keywords that are not so prefixed. [...]

> All we're doing is recognizing usage in as conservative a way as
> possible.  To millions of people, these entity names have become part
> of the language.

The names from the Symbol font are in use by millions of people?
I don't believe this.  Actually, apart from the accented characters
in ISO 8859-1, I don't accept that any of those entity names are being
typed knowingly by such quantities of people, and since many of these
entities don't work in today's browsers, I don't buy this at all.

Have you tried them in Netscape 2.02?  Netscape 3?  MSIE 3?

> Anyone who uses &omega; at this point to mean
> anything other than GREEK SMALL LETTER OMEGA is either being
> obstinately perverse or just ignorant.

Possibly, but since it doesn't mean that in HTML 2.0, nor in the
implementations in use, and since in SGML you can use it for whatever you
like, it would not be unreasonable for someone to use it for the
engineering OHM symbol, for example.  The SGML documentation for the
"omega" game might well use it for the name of that game, and I
would not call those people "obstinately perverse" nor "ignorant".

And taking a name like "prod" seems especially hard to defend.

> All we can do is to
> make the most common names part of the standard so that we are at
> least doing our best to eliminate the latter possibility.

If you allow them to be overridden in documents, then it is no longer
a problem if people have existing material using them.
If you have a fixed list, it would be sensible to make it smaller.

If HTML is to be extended to include these entities, it would not
seem unreasonable to require some sort of "import" statement in
the documents... and of course, SGML has a syntax for that already.

I am actually less concerned about backwards compatibility with other
standards in this matter as I am with the principle of least surprised.

Consider the following tiny XML document:

    <?XML charset "iso8859-1">
    <!DOCTYPE fred [
	<!Element fred (#PCDATA|title|p)*>
	<!Element title (#PCDATA)*>
	<!Element p (#PCDATA)*>

	<!Entity book "Reference Manual">
	<!Entity prod "Airline Pilot 3.1 for Windows">
	<!Entity SGML % 'IGNORE'>
	<![ %SGML; [
	<!Entity % SDATA "SDATA">
	]>
	<!Entity SDATA ''>

	<!Entity ndash %SDATA; "-">
    ]>
    <fred><title>&prod; &endash; book;</title>
    <p>Thank you for purchasing &prod;.
    This %book; will take you through the steps of learning to fly
    your new aeroplane.</p></fred>

In this example, an SGML parser will produce a different ESIS from an
XML parser.  The XML document will have the greek letter PI in it
for a start, whereas the SGML document will include the product name.

Now, I for one don't want to get involved in technical support
at this point!

If it is not possible for an XML DTD to define general entities at all,
then this problem goes away, and conversion from SGML to XML will
obviously involve entity expansion.  One could perhaps use David
Megginson's SGMLS.pm perl package to do this, and maybe James might
make a version of SPAM that did it (CUSTARD?).

We've had a seqence of what to me are very odd decisions lately,
ranging from a fixed list of elements declared EMPTY if through some
unspecified heuristic the current document is determined to be HTML
(it can't be by the published formal definition, as that's illegal
in XML, unless that's simply an ERB oversight), through to some
strange and SGML-incompatible entities creeping in.

I'm worried about this.  Should I be?  Is XML going to be less useful
for its stated purpose (using SGML over the Internet) as a result?
I think so.

Is it going to be harder to implement?
Yes, because now if you have a 2nd definition for an entity, you
need to know who defined it in order to determine whether
to override it or not.  User-defined entities can be overridden
provided that their names don't conflict with system-defined ones,
but system-defined entities cannot be overridden.

Note that this has serious implications for document lifetimes:
if XML is revised to include more entities, you WILL break documents.
The lack of a required XML Version string at the start ensures this.

People objected to my proposal that all XML documents start with
<XML REV="1.0"....>
saying that it was namespace pollution.  But the same people seem
happy to add over 100 names without seeming to consider the
consequences very carefully.

So when you say:
> No argument about relative usefulness here,
> just consistency of W3C recommendations.

what I in fact see is total arbitrariness.  What am I missing?

Lee
Received on Sunday, 10 November 1996 17:12:26 UTC