- From: Murray Maloney <murray@sq.com>
- Date: Wed, 13 Nov 1996 11:39:57 -0500
- To: lee@sq.com
- Cc: w3c-sgml-wg@w3.org
At 01:44 AM 11-11-96 EST, lee@sq.com wrote: >>|> The list that we've adopted simply mirrors a decision of the HTML ERB. >>|> Their decision was to include the symbol set; >To include it in what exactly? > Hi Folks, I am back with one of my very infrequent mailings to the SGML ERB. I think that it is time that I 'fess up and try to explain what happened, how it happened, and why this is a "good idea"(tm). In 1993, long before Netscape was born and even longer before Microsoft realized that the Web might even matter, the folks at the Santa Cruz Operation (SCO) got one of the world's first commercial licences to the source of NCSA Mosaic. At the time, I was a tech pubs manager and publishing systems architect with SCO at that time. We did some innovative work developing the entire SCO documentation set for use in a distributed online environment. This includes both the online documentation and the context-sensitive help. I have demonstrated the scohelp application at SCO Forum, SGML '95, Seybold Boston, and at a Davenport Group meeting. Most of the added value that SCO included in scohelp was done without compromising any of the then so-called "standards" of the Web. That is, SCO's technical publications group made a clear and conscious decision to avoid introducing any new or variant element types to HTML. This decision was taken at my insistence and readily agreed to by my associates, including Bob Stayton. A large part of my motivation was to operate from a clear position of SGML-conformance. I felt that it would strengthen SCO's hand (and mine, if the truth must be known) to maintain SCO's committment to open standards and interoperability. However, we had a problem to deal with. SCO's documentation set includes numerous characters which are not part of ASCII or 8859-1. We noted quickly that most of the characters that we needed were included in the ISO character sets and also available from the Adobe Symbol set. So, as early as 1993, Bob Stayton made a proposal to the then-nascent HTML Implementor's Group at its first and only meeting in Boston. His proposal called for supplementing the list of valid named character entities with the common set from ISO and the Adobe Symbol font. The idea, quite simply, was to establish a protocol that, through agreement and common understanding, would allow everyone to broaden their ability to communicate over the Web. Not a bad goal, if I do say so myself. Anyway, for reasons which I still question, we did end up having to "add a tag" to SCO's version of HTML. The new element type, whose name was SYM or SYMBOL, was generated by conversion software and surrounded any named character entity that was a member of the Adobe Symbol set. Clearly this was kludge which was needed because of the way that NCSA Mosaic handled font switching. [SCO continued to author documents using a highly structured and semantic set of troff macros, and these documents were batch-converted to HTML with full hypertext linking and cross-referencing.] Recently, I arranged for Bob Stayton to come to an HTML ERB meeting to present his original proposal again, three years later. Bob's presentation was brief and to the point. The HTML ERB immediately understood that it would be useful not only to extend the set of commonly understood names, but to adopt this set quickly because it could be supported with very little effort. No other proposal before the HTML IG, HTML WG or HTML ERB has ever been adopted as quickly or with such unanimity. I hold that this was a good thing. Whew! Deep breath. OK, back to the arguments... Several people have pointed out that SGML has a well-defined mechanism to allow anyone to create a document, declare and reference entities. They have asked why we should want to adopt a set of entity names that will be hardwired in perpetuity, and why we should prevent a document creator from using any of those entity names for their own purposes. My answers are simple, and perhaps simple-minded, but here they are: I think that it is all well and good that SGML provides the means to do "whatever you want", and that is very powerful, but that is also a great inhibitor. What I mean is that being able to do "anything" doesn't really help you to do "something". The principal reason, in my mind, why HTML has been such an overwhelming success is that it provides a very limited and specific set of capabilities that anyone can use almost immediately. The fact that I can type in < and know that I will get a "<" in the final form document enables me to spend my time creating and managing information rather than worrying about how I am going to ensure that my document will parse, or what the correct syntax is for an entity declaration, or whether the receiving software is going to understand how to interpret either or both of the entity declaration and the entity reference. [And I am not arguing that XML should grandfather HTML, although I would happily participate in such a discussion. What I am saying is that it works. So, just because a lot of us don't like HTML, we shouldn't immediately turn away from the choices that that application has chosen out of spite.] In theory, all SGML systems should be able to correctly interpret my intention when I send a document that includes references to named character entities. And in truth, any conforming SGML system is capable of doing that. But saying that it is capable is not the same as saying that it will or does do it. The disappointing, to me, truth is that I would have to do a lot of work to ensure that my DTD contains all of the correct declarations, or at least includes them by reference, and even more work to ensure that I have set up my "system" to transmit all of the relevant files to downstream processors (read browsers and publishing systems). And even then, I cannot depend on all of the potential downstream processors having been set up in any way that will faithfully represent the intentions that I imbued in my document. Boy, this is a long post. But I am not done yet. So, just because it is possible for all of the world's SGML systems to be able to handle the arbitrary intentions imbued in documents by their authors, or by the publishing systems architects who saddled their authors with unintuitive and complex tools, I don't think that that is very helpful to anyone. What I want to see is a set of application profiles that allow me to do "something". Having just spent much of the past nine months completing a book, and using a variety of conforming SGML applications to do get the job done, I have to say that I am not happy with the state of the art. And although named character entities represented only one of the annoying little gotchas that I encountered during that process, they have been a constant source of annoyment to me throughout the past ten years. [And, in truth, I have been using SGML tools to develop, manage and maintain documents throughout much of the past ten years. The first Author/Editor manual, of which I was the editor, was written using A/E and processed for publication using a variety of SGML and non-SGML-aware tools.] So, I unashemedly agree with Jon's request. An XML application should and must be capable of faithfully processing a well-defined and agreed set of named character entities, even in the absence of declarations for those named character entities. And while I would prefer that this set be as comprehensive as possible, I submit that the minimum set that is supportable with "very little effort" is the set which includes all of the ASCII set (dec 32-126), all of the 8859-1 set (dec 158?-255), and those characters from the Adobe Symbol set that have corresponding names among the ISO list of named character entities. Thanks for your patient reading of this lengthy posting, Murray Maloney SoftQuad Inc.
Received on Wednesday, 13 November 1996 11:39:18 UTC