Re: Make HTML a real SGML application

Daniel W. Connolly (
Tue, 30 Jul 1996 14:41:01 -0400

Message-Id: <>
Subject: Re: Make HTML a real SGML application 
In-reply-to: Your message of "Tue, 30 Jul 1996 09:29:39 MST."
Date: Tue, 30 Jul 1996 14:41:01 -0400
From: "Daniel W. Connolly" <>

In message <199607301629.JAA05748@athena>, Mary Holstege writes:
>I don't understand the terrible resistance to allowing (encouraging) HTML file
>to contain SGML prologues and using the power implied by the existence of
>that to achieve useful results.  Most of my serious HTML *already*
>has a <!DOCTYPE section; I just have to run everything through SPAM
>before I put it out.  The standard HTML DTD can contain some of the popular
>notations; if you want to do anything funky, you have to embed some funky
>syntax.  OK.  And the problem with this is...?

Nothing. You're free to use SPAM at your site to preprocess your
documents, just like other folks use cpp or m4 or perl or..., and some
other folks use server-side includes features built into a server.

But you seem to be suggesting that the syntax of HTML as transferred
over the wire should be allowed to include an internal declaration
subset.  You'll have to be more explicit for me to evaluate your

>Why is a concept that comes from SGML always presumed "too hard" but some
>random half-backed hack considered "easy enough for the masses"?

Easy: free access to the documentation and source code. The cost of
just getting the documentation for SGML is about $100. That's MUCH
harder than clicking on a link to NCSA's server-side-includes

I have little sympathy for folks who don't do their homework
(i.e. folks who don't read the SGML materials that _are_ available on
the web, which see: but
I have even less sympathy for folks who create something as obtuse
and contrived as SGML, and then hoard access to it.

>Why is "<!--#include" easy enough for the masses to understand, but "<!ENTITY
>foo SYSTEM" is too hard?

First, note that <!--#include is _not_ HTML syntax: it's NCSA httpd
server-side-includes syntax, which has since been supported by lots of
other stuff.

And I don't think mass understanding had anything do do with it: it's
a simple case of <!--#include is supported by widely available tools,
and <!ENTITY is not.

I wish it were the other way around. In fact, Elliot kimber
demonstrated on comp.text.sgml how to set up a CGI script to process
<!ENTITY stuff using spam. But it seems to be a day late and a dollar

Folks who want to change the landscape are encouraged to hack!
For example, make sp (the backbone of stuff like spam) and hack
it into an apache module, write some documentation with examples,
and see if it takes off.

>  Why is long distance naming in "<A NAME=foo>...<A
>HREF="#foo">" easy enough for the masses to master but that in 
>"<!ENTITY foo...>...&foo;" too hard?

Hang on: the choice is between:

	<a href="">


	<!entity foo system "" NDATA>
	<a href=foo>

In this case, the object in question has a perfectly good name: The name foo serves no purpose but to introduce errors
etc. (If there were several references to in the document, the
foo might serve as a shorthand, and that might be valuable. But it's
not valuable enough to complicate the simple case.)

The question regarding "<!ENTITY foo...>...&foo;": is simple: to do it
or not to do it (in the client). So far, none of the implementors has
seen enough benefit to justify the cost. Given that it can be done
on the server side (and often more efficiently), I tend to agree.

I don't like <!ENTITY...> as a mechanism for doing compound documents.
I like typed links much better. It's like the difference between
python/perl/Java style import vs. C/C++ #include: one's a text pasting
excercise, and the other is a structural construct.

>  Why is "// <!-- ... // -->" easy enough 
>for the masses to understand but "<![ CDATA [...]]>" too hard?

I can't begin to defend the //<!-- script syntax. But given the
state of affairs, how would you convice information providers to
begin to use <![ CDATA [ ... ]]> when it won't work on "70%"
of their consumer's desktops, while //<!-- will?

>   Why do we have
>to put up with people inventing "<!--XXX IFDEF FOO-->...<!--XXX ENDIF-->" but
>refusing to encourage "<![ %FOO; [ ...]]>" which does the job just as well, 
>and can be processed by standard tools?

You don't have to "put up with" anything. But you don't have to whine
either. Write some code. Write a draft.

See, for example:

Note that SGML marked sections can express #if/#endif nicely,
but #elsif is very awkward.

>I think it's time to fish or cut bait: if HTML is to be an SGML application, 
>use the features of SGML that are required to make it workable.

Why the hypothetical? HTML is an SGML application. Check RFC1866.

And there is overwhelming evidence that it is "workable." I think
altavista advertises some 30 million pages.

>  There is
>much I would have changed about SGML if I had been its inventor, but the
>fact is that it is here,


> it has solutions to a lot of these problems,

This has been alleged over and over, but the conjecture is rarely
backed by sound arguments, code, specs, etc.

> and
>if HTML is an SGML application a lot of nice tools can be used to handle it.
>Tracking changes from version to version of HTML with these tools becomes a
>matter of dropping in a new DTD instead of hacking up the tool to understand
>the siginifance of some new semantics embedded in comments or some special
>handling required for the FOOBAR element.

This is just FUD. The change from HTML 2.0 to 3.2 to cougar is
"just dropping in a new DTD". There's nothing "special" about
the script/comment syntax as far as SGML is concerned:
	<script><!-- script --></script>
is just an element with content "<!-- script -->". Clearly, in
order to interpret the content of the script element, you have
to understand the script language syntax. And javascript happens
to define <!-- as a comment.

>  It is very clear to me that we
>cannot go much further without putting (allowing, defaulting, supporting) the
>SGML prologue into HTML. 

I disagree. For my argument, please see:

>In particular:
>    NOTATION could be used quite nicely for both SCRIPT and MATH (NOTATION=TeX
>anyone?) It would allow for direct experimentation with other scripting

There's nothing about NOTATION that facilitates this experimentation.
You can do it with MIME types in CDATA attributes just as well
as SGML notations.

> Parameter ENTITYs (particularly if you support URL SYSTEM
>identifiers) allows you to very neatly encapsulate common boilerplate or
>decorations and ease maintenance. 

Again: are you suggesting this as a local server-side feature, or an
extension of the over-the-wire HTML standard? If it's just a question
of maintenance, using SPAM and entities makes a lot of sense for
local document management.

>While we're at it, can't we at least have a sentence somewhere official
>encouraging support of processing instruction syntax instead of random comment
>hackery?  Please?  

What would such as sentence say? Would you care to write the draft?