Re: (Repeat) Decision: C.4 (Predefined entities) from Murray Maloney on 1996-11-13 (w3c-sgml-wg@w3.org from November 1996)

From: Murray Maloney <murray@sq.com>
Date: Wed, 13 Nov 1996 11:39:57 -0500
To: lee@sq.com
Cc: w3c-sgml-wg@w3.org
Message-Id: <2.2.32.19961113163957.0070caf8@sq.com>
At 01:44 AM 11-11-96 EST, lee@sq.com wrote:
>>|> The list that we've adopted simply mirrors a decision of the HTML ERB.
>>|> Their decision was to include the symbol set;
>To include it in what exactly?
>

Hi Folks,

I am back with one of my very infrequent mailings to the SGML ERB.

I think that it is time that I 'fess up and try to explain 
what happened, how it happened, and why this is a "good idea"(tm).

In 1993, long before Netscape was born and even longer before
Microsoft realized that the Web might even matter, the folks at
the Santa Cruz Operation (SCO) got one of the world's first 
commercial licences to the source of NCSA Mosaic. At the time,
I was a tech pubs manager and publishing systems architect with SCO
at that time.

We did some innovative work developing the entire SCO documentation
set for use in a distributed online environment. This includes
both the online documentation and the context-sensitive help.
I have demonstrated the scohelp application at SCO Forum, SGML '95,
Seybold Boston, and at a Davenport Group meeting.

Most of the added value that SCO included in scohelp was done
without compromising any of the then so-called "standards" of the Web.
That is, SCO's technical publications group made a clear and conscious
decision to avoid introducing any new or variant element types to HTML.
This decision was taken at my insistence and readily agreed to by
my associates, including Bob Stayton. A large part of my motivation was
to operate from a clear position of SGML-conformance. I felt that 
it would strengthen SCO's hand (and mine, if the truth must be known)
to maintain SCO's committment to open standards and interoperability. 

However, we had a problem to deal with. SCO's documentation set 
includes numerous characters which are not part of ASCII or 8859-1.
We noted quickly that most of the characters that we needed were
included in the ISO character sets and also available from the 
Adobe Symbol set. So, as early as 1993, Bob Stayton made a proposal
to the then-nascent HTML Implementor's Group at its first and only
meeting in Boston. His proposal called for supplementing the list
of valid named character entities with the common set from ISO and 
the Adobe Symbol font.

The idea, quite simply, was to establish a protocol that, through
agreement and common understanding, would allow everyone to broaden
their ability to communicate over the Web. Not a bad goal, if I do 
say so myself.

Anyway, for reasons which I still question, we did end up having 
to "add a tag" to SCO's version of HTML. The new element type,
whose name was SYM or SYMBOL, was generated by conversion software
and surrounded any named character entity that was a member of
the Adobe Symbol set. Clearly this was kludge which was needed 
because of the way that NCSA Mosaic handled font switching.
[SCO continued to author documents using a highly structured
and semantic set of troff macros, and these documents were 
batch-converted to HTML with full hypertext linking and
cross-referencing.]

Recently, I arranged for Bob Stayton to come to an HTML ERB meeting
to present his original proposal again, three years later. Bob's
presentation was brief and to the point. The HTML ERB immediately
understood that it would be useful not only to extend the set of
commonly understood names, but to adopt this set quickly because
it could be supported with very little effort. No other proposal
before the HTML IG, HTML WG or HTML ERB has ever been adopted 
as quickly or with such unanimity. I hold that this was a good thing.

Whew! Deep breath. OK, back to the arguments...

Several people have pointed out that SGML has a well-defined mechanism
to allow anyone to create a document, declare and reference entities.
They have asked why we should want to adopt a set of entity names
that will be hardwired in perpetuity, and why we should prevent a
document creator from using any of those entity names for their own
purposes.

My answers are simple, and perhaps simple-minded, but here they are:
I think that it is all well and good that SGML provides the means
to do "whatever you want", and that is very powerful, but that is
also a great inhibitor. What I mean is that being able to do "anything"
doesn't really help you to do "something". The principal reason, in
my mind, why HTML has been such an overwhelming success is that it
provides a very limited and specific set of capabilities that anyone
can use almost immediately. The fact that I can type in &lt; and know
that I will get a "<" in the final form document enables me to spend
my time creating and managing information rather than worrying about
how I am going to ensure that my document will parse, or what the 
correct syntax is for an entity declaration, or whether the receiving
software is going to understand how to interpret either or both of
the entity declaration and the entity reference.

[And I am not arguing that XML should grandfather HTML, although
I would happily participate in such a discussion. What I am
saying is that it works. So, just because a lot of us don't like
HTML, we shouldn't immediately turn away from the choices that
that application has chosen out of spite.]

In theory, all SGML systems should be able to correctly interpret
my intention when I send a document that includes references to
named character entities. And in truth, any conforming SGML system
is capable of doing that. But saying that it is capable is not the
same as saying that it will or does do it. The disappointing, to me,
truth is that I would have to do a lot of work to ensure that my
DTD contains all of the correct declarations, or at least includes
them by reference, and even more work to ensure that I have set
up my "system" to transmit all of the relevant files to downstream
processors (read browsers and publishing systems). And even then,
I cannot depend on all of the potential downstream processors having
been set up in any way that will faithfully represent the intentions
that I imbued in my document.

Boy, this is a long post. But I am not done yet.

So, just because it is possible for all of the world's SGML systems
to be able to handle the arbitrary intentions imbued in documents
by their authors, or by the publishing systems architects who
saddled their authors with unintuitive and complex tools, I don't
think that that is very helpful to anyone. What I want to see is
a set of application profiles that allow me to do "something".
Having just spent much of the past nine months completing a book,
and using a variety of conforming SGML applications to do get the
job done, I have to say that I am not happy with the state of the art.

And although named character entities represented only one of the 
annoying little gotchas that I encountered during that process,
they have been a constant source of annoyment to me throughout
the past ten years. 

[And, in truth, I have been using SGML tools to develop, manage 
and maintain documents throughout much of the past ten years. 
The first Author/Editor manual, of which I was the editor, was 
written using A/E and processed for publication using a variety 
of SGML and non-SGML-aware tools.]

So, I unashemedly agree with Jon's request. An XML application
should and must be capable of faithfully processing a well-defined
and agreed set of named character entities, even in the absence
of declarations for those named character entities. And while
I would prefer that this set be as comprehensive as possible,
I submit that the minimum set that is supportable with "very 
little effort" is the set which includes all of the ASCII set
(dec 32-126), all of the 8859-1 set (dec 158?-255), and those
characters from the Adobe Symbol set that have corresponding names
among the ISO list of named character entities.

Thanks for your patient reading of this lengthy posting,

Murray Maloney
SoftQuad Inc.
Received on Wednesday, 13 November 1996 11:39:18 UTC