Re: <insert> and external entity references

Murray Altheim (
Tue, 19 Mar 1996 18:58:54 -0400

Message-Id: <v02110100ad74b6aa1fee@[]>
Date: Tue, 19 Mar 1996 18:58:54 -0400
To: "C. M. Sperberg-McQueen" <>
From: (Murray Altheim)
Subject: Re: <insert> and external entity references

C. M. Sperberg-McQueen <> writes:
>Excuse me if someone explained this while I was not paying attention,
>but why are we talking about adding an INSERT tag with the semantics 'go
>find this file or document, and insert it here', when SGML already has
>the mechanisms needed for this, in the form of entity references?  Why
>not just start writing, requesting, or demanding HTTP servers that
>actually understand and process references to external entities
>as defined by ISO 8879?

A very good question. I've raised this several times in the past few
months. Amanda Walker had talked about writing up a draft for allowing
various SGML features to HTML. Dan and I also have conversed on this topic.
Terry Allen continues to advocate SGML as a solution. General response to
this topic is sort of like talking into the wind.

The Web has a strong 'not-invented-here' problem when it comes to SGML.
Almost without fail, every feature currently demanded has been solved in
various ways within a ten year old specification: SGML. And not much would
have to be changed in HTML to make these things work; we'd simply allow
existing SGML features that are currently proscribed in HTML. The problem
is as you mentioned, a lot of software would have to be rewritten. Since
the entity management happens potentially at both server and client,
possibly both client and server rewrites. At the beginning stages it could
be simplified. See below.

This would really be the big change: not using HTML as the base language of
the Web. We'd use SGML (MIME type "text/sgml; level=1|2|3|4"), allowing the
DOCTYPE of the document to determine the DTD, just as in SGML. That DOCTYPE
could simply specify a dialect of HTML for the current majority of web

"Level" describes a migration path, where for example:

 +  Level 1 is HTML as today, only in an SGML 'wrapper'. Documents would
    be required to be valid (what does this mean?)
 +  Level 2 would handle HTML with minimal SGML features (text literal
    entity declarations, marked sections for conditional documents,
    minimal SGML additions, etc.)
 +  Level 3 would a specify a larger character entity list, CDATA marked
    sections, etc.
 +  Level 4 would be full SGML compliance, allowing SUBDOC, LPDs, etc.
    as according to ISO 8879.

>To illustrate, for those not yet conversant with all of SGML:  for
>this ...
[explanation of SGML entities...]
>If it is desired (as some have proposed) that the external entity
>be parsed as a completely independent object, the required variation
>in the syntax is again already provided by SGML:  just declare
>the external entity as a SUBDOC (i.e. a free-standing document, to
>be parsed on its own, not as part of the current document).
> <!ENTITY myfile SYSTEM "/usr/me/public_html/file.html" SUBDOC>
>N.B. not all SGML software supports the SUBDOC feature, just as
>not all SGML software understands URLs, which are not after all
>defined by ISO.  That shouldn't make too much difference, I think:
>we are talking about a change to HTTP and/or HTML, and that means
>rewriting at least some software.

As you know, SUBDOC engenders a number of rather difficult problems
(recursion problems for one). But yes, it does solve the problem. We'd
simply put an application requirement that the SGML system understand URLs.

>Is there an advantage to inventing a new notation for inclusion
>of documents and document fragments, rather than using the
>existing notation?  Or is it just not widely known that notation
>for such inclusion already exists and need only be adopted, instead
>of being invented?

I don't know if this information is widely known or not. My feeling is that
some people think that many of the more 'advanced' SGML features use a
syntax that is a little too arcane for 'common HTML authors' (whatever that
means). But is seems amazing to me sometimes that we keep trying to
reinvent the wheel when a solution is already at hand and written in an ISO

Here's some of my rough notes with HTML-to-SGML evolution mapped as levels.
Level 1 is HTML 2.0 conformant (RFC 1866). Nothing solid here, just ideas.

 +  Include full corpus of ISO 8879:1986 (SGML) entity sets (Appendix D in
    Goldfarb). These are already standardized and widely available:
        _Entity_Set________________               _Level_
        Added Latin 1                                1
        RFC 1866's 'recommended' additions           1?
        Added Latin 2                                2
        Diacritical Marks                            2
        Publishing                                   2
        General Technical                            2
        Greek Letters                                2
        Greek Symbols                                2
        Monotoniko Greek                             3
        Alternative Greek Symbols                    3
        Numeric and Special Graphic                  3
        Added Math Symbols: Arrow Relations          3
        Added Math Symbols: Binary Operators         3
        Added Math Symbols: Delimiters               3
        Added Math Symbols: Negated Relations        3
        Added Math Symbols: Ordinary                 3
        Added Math Symbols: Relations                3
        Box and Line Drawing                         3
        Russian Cyrillic                             3
        Non-Russian Cyrillic                         3

        Levels as shown above. A level n conforming application would support
        all entity references up to that level.

 +  Marked sections
        Level 1: none
        Level 2: allowing INCLUDE, IGNORE
        Level 3: allowing CDATA (non-SGML data)
        Level 4: allowing RCDATA (replaceable character data: entity
                     references are replaced.)

 +  #DEFAULT entity declaration
        Level 1: none
        Level 2: allowed as text literal
        Level 3: allowing PUBLIC/SYSTEM entity references

 +  Declaration subsets (with limitations based on 'level' as above)
        Level 1: none
        Level 2: text literals only
        Level 3: PUBLIC & SYSTEM entity references
        Level 4: SUBDOC allowed

 +  Remove/increase quantity limits on SGML (go with dynamically allocated
       buffers, or advocate using catalogs to declare SGML declarations with
       limits beyond the implicit 8879 declaration to handle specific
       document needs.)
        Level 1: same as HTML (RFC 1866)
        Level 2: expanded to larger limits
        Level 3: dynamically allocated or using catalogs with

 +  Change SGML declaration charset (as per i18n draft) to be ISO 10646.
        This is already proposed.
        Level 1: same as HTML (RFC 1866).
        Level 2: same as i18n

 +  Allow NOTATION TeX, etc.
        Level 1: none
        Level 2: allowed

 +  Link Process Definitions, particularly for ICADD support. (Stylesheet
    support could conceivably be handled via LPDs. There are certainly
    better ways...)
        Level 1: none
        Level 2: none
        Level 3: allowed

 +  Processing instructions (these are typically system-dependent, so I lean
    away from their use on the web.)
        Level 1: none
        Level 2: none
        Level 3: allowed (with restrictions)
        Level 4: allowed (no restrictions)

The fact that from my own testing, our own (Stonehand) HTML viewer can
support most of this list says to me that it can be done. Performance is
not a big issue once the DTD is compiled. The question is whether or not
large vendors are willing and able to implement this type of radical
departure. For browser engines not based on an SGML engine, this may be
next to impossible.

Those who've been hanging around here awhile have seen a similar list
before. I'm still willing to deal with this, but not if nobody wants it.
Last time this was brought up, Dan Connolly, Terry Allen, I and a few
others talked a bit between ourselves but there certainly wasn't any
overwhelming consensus of support. The current push continues in the
direction of further complexity of the HTML language and browsers. The next
version of HTML only gets thicker and farther away from being authored or
read by humans.

As above, I think the solution to many of the current language needs comes
down to using an SGML MIME type, where the specific SGML application is
determined by DOCTYPE. Otherwise, we are simply coming up with another
syntactic alternative to features that exist in any SGML application except
HTML, where these existing features have been disallowed. But with this,
many (or even most) of the current document assumptions made by current
HTML browsers go out the window -- we would need SGML-compliant browsers.
This would certainly violate Amanda Walker's "Minimum User Astonishment"
design principle.


     Murray Altheim, Program Manager
     Spyglass, Inc., Cambridge, Massachusetts
     email: <>
     http:  <>