SGML entities

C. M. Sperberg-McQueen (cmsmcq@uic.edu)
Tue, 19 Mar 96 15:11:47 CST


Message-Id: <9603192231.AA18080@www10.w3.org>
Date:         Tue, 19 Mar 96 15:11:47 CST
From: "C. M. Sperberg-McQueen" <cmsmcq@uic.edu>
Subject:      SGML entities
To: www-html@w3.org

Well, several people have replied to my note, explaining that what they
want to do with the <insert> tag cannot be done with SGML entity
references for a variety of reasons.  This would be more convincing than
it is, except that so many of the reasons seem to point not to flaws in
the SGML entity notation but to misunderstandings and misconceptions.
On the whole, the reasons given appear to be designed to convince an
observer that knowledge of SGML is still pretty scarce among those
discussing the future of HTML on this list.

To take a few points, one by one.

Chris Lilley (<lilley@afs.mcc.ac.uk>) writes

> A few problems with this:
>
> 1) General practice on the Web is to not ship the DTD around with
> the document instance. Thus, it is tricky to add your own
> entities like this.

I thought I made it clear that expansion of the entity reference would
be handled by the server, not by the client.  In an ideal world, it
might be nice to have it be done sometimes by the server, sometimes
by the client.  But that seems hard to work into http now.

If the server expands the entity reference, then the declaration of
the entity in question can be dropped, and we don't *have* to ship the
DTD around.

But then, if we do ship the DTD around, what happens?  Browsers which
don't know what to do with it may not do the right thing with it.

This sounds rather similar to what happens if we use an <INSERT>
element or any other element:  browsers which don't know what to do
with them may not do the right thing with them.  On the whole, I don't
see that this argument would provide a reason to prefer one syntax over
the other, even if it were right in assuming that the DTD subset needs
to be shipped around the net with the document.

> 2) It is not flexible enough. Consider how this example might be coded
> using the proposed syntax:
>
>  <insert
>     classid="java:Fishy.class"
>     code="http://site/applets/Fishy.class"
>     width="4in"
>     height="2.7in"
>     border="12pt"
>     type="application/x-java-bytecode">
>        <param name="species" value="guppy">
>        <param name="current" value="strong">
>   </insert>

This combines the insertion of arbitrary content into the document
with the specification of a window within which Java code will run.
You *could* jam both of these into an entity reference, but it would
be better, probably, to keep these separate.  If we use the APPLET
APPLET element defined for Java, the definition of the entity, and
its association with the embedded applet, are both straightforward.

<!ENTITY myfish SYSTEM 'http://site/applets/Fishy.class'
                NDATA java-bytecode
                [ classid="java:Fishy.class"
                  param  ="species = 'guppy' current='strong'"
                ] >

This assumes a definition of the java-bytecode notation which might
be something like this:

  <!NOTATION java-bytecode
      PUBLIC '-//Sun Microsystems//NOTATION Java byte-code//EN'>
  <!ATTLIST  #NOTATION java-bytecode
             classid   CDATA  #REQUIRED
             param     CDATA  #IMPLIED   >

In the document, then, your ten lines become:

  <applet width="4in" height="2.7in" border="12pt">

In short, you seem to be arguing that SGML entity declarations are
'not flexible enough' without taking into account what they can and
cannot do.

Abigail sans Surname (who may or may not be a pseudonym of H. Schipper)
writes:

> There are various reasons not to do it. First of all, it would
> only work if the included file is an html fragment; <insert>
> could as easily include an image or video.

SGML entities are by no means restricted to SGML content.  (See example
just given.)  They can as easily contain images or video.  This does not
seem to be a reason to prefer one notation over the other.

> A second reason not to have the server include the documents is
> caching. If the included or the including file changes often
> and the other hardly, agents might benefit from caching, which you
> would lose when you let the server deal with it.

Yes.  That might be a reason to prefer client-side expansion of
entity references.  On the other hand, any method one chooses of
organizing data is apt to pessimize some caching scheme or other,
under the right circumstances.  If the client does the expansion
of the entity reference or INSERT, and the material inserted changes
frequently, the copy in the cache is apt to be out of date.
My copy of Netscape does not detect this:  it just happily shows
me the outdated cached copy of changed documents until I force a
reload, manually.  But either way, this is an argument for doing
expansion on the server or the client side, not an argument for
inventing a new notation for existing SGML functionality.
Or am I wrong?

> Third reason is that it requires SGML aware servers. Apart from
> the new software which is needed, it means servers have to parse
> all outgoing documents; which will mean a degrade in performance.

Not necessarily:  a server does not have to be fully SGML compliant (or
even fully SGML aware) to recognize and act appropriately on entity
declarations and entity references.  We are talking about making either
servers or clients perform a new kind of service, namely inclusion of
external material, for which we need some kind of notation.  The
software is going to have to be changed either way; it can be changed to
recognize a new tag (<INSERT>), or it can be changed to recognize an
existing entity reference syntax.  Neither requires full SGML support.
(Although full SGML support would be a damn good thing for the Web:  for
further discussion, see the paper Bob Goldstein and I wrote, at
http://www.uic.edu/~cmsmcq/htmlmax.html.)

If the server *is* SGML compliant, and parses all documents, then it's
probably better to parse and validate all static documents each time
they are revised, rather than each time they are retrieved.  Dynamic
documents (i.e. documents created on the fly by running processes)
probably can't be handled that way; maybe they do need to be parsed at
retrieval time.

Paul Prescod <papresco@calum.csclub.uwaterloo.ca> writes:

> The first (perhaps the most major) is that you must declare
> entities in the DTD, and most browsers do not even _support_ a DTD
> (either within the same document or elsewhere). The concept of a

If we are talking about client-side inclusions, then this is equally an
argument against <INSERT>, right?  Most browsers do not even _support_
<INSERT> now.  And yet, the list seems to be discussing it without
panic.  Similarly, most browsers do not support document type
declarations.  Nu?

> The second is that HTML authors do not like "naming" things that
> already have names. In other words, they do not like giving an SGML
> entity name for something that already has a URL. Part of the
> difference between the HTML community and other SGML DTD user grous
> is that most HTML authors do not use "smart" authoring tools, and
> SGML-smart authoring tools are especially rare.

This seems to me rather a large generalization, but even taken at face
value I'm not sure it's an argument for reinventing yet another wheel.

> The third is that at this point HTML's linking and embedding
> paradigm is pretty far from most SGML DTD's, so a radical shift in
> the direction of SGML linking would probably leave the vendors and
> users behind, and they would implement their own <INSERT> tags in
> the manner that seems natural to them.

I don't follow this at all.  We were talking about whether inventing
a new tag, or using the existing entity/entity reference notation,
was a more sensible approach to embedding the same material in many
documents, or in one document many times.  Would it be a radical shift
in HTML to use SGML syntax for entities?  Hmm.  OK, then, I'm a radical.
SGML NOW!  SGML NOW!! SGML NOW!!!  YOUR BROWSER DON'T SUPPORT IT?  THEN
BURN, BABY, BURN.

It's a heady feeling, being a radical just by suggesting that existing
solutions be reused when they meet all the needs expressed in a
discussion.  I like it.

> None of these are insurmountable, and I would hope and expect that
> we will see SGML entities in typical Web browsers soon, but I also
> think we need something in the meantime. I would suggest an
> overloading of the EMBED element without any attempt to combine the
> two documents.

This is the first explanation I've seen that seems to make a grain of
sense -- and even it doesn't make all that much sense.  I can sympathize
with the desire to get something that works, in the short run.  But
since the existing solution already *exists*, and can be implemented
with much less arguing about details (just follow 8879), I don't see
why using it should talk that long.  Reusing the wheel is faster than
reinventing it.

Unless, of course, you count the time it takes to persuade people that
(a) the wheel exists, (b) it meets the specifications, and (c) it
doesn't have to be rejected just because it was invented somewhere else.

C. M. Sperberg-McQueen