SGML Cop backs off

Dan Connolly (connolly@pixel.convex.com)
Tue, 17 Nov 92 08:53:07 CST


Message-Id: <9211171453.AA07331@pixel.convex.com>
To: Edward Vielmetti <emv@msen.com>
Cc: www-talk@nxoc01.cern.ch
Subject: SGML Cop backs off
In-Reply-To: Your message of "Tue, 10 Nov 92 15:13:07 EST."
             <m0mp1xh-00009MC@garnet.msen.com>
Date: Tue, 17 Nov 92 08:53:07 CST
From: Dan Connolly <connolly@pixel.convex.com>

I've reconsidered my position on the "framing tags" in HTML after
a more careful consideration of the SGML standard, and after
receiving the O'Reilly/HaL DocBook materials and the MidasWWW
browser.
 
To refresh your memory...
 
>     Currently HTML documents are transmitted without the normal SGML framing
>     tags, but if these are included parsers will ignore them.
>
>I don't know what "the normal SGML framing tags" are. An SGML document
>has three parts: the SGML declaration, the prologue, and the instance.
>It is common in SGML applications to use an implied SGML declaration
>and include the prologue by reference (kinda like an #include
>directive in C.) but without these "framing tags," it's just not an
>SGML document.
 
The SGML standard is big on the distinction between Entities and
everything else; that is, the physical breakup of an SGML document
into storage units such as files, directories, MIME body parts,
collectively "entities" is pretty much arbitrary (you can't break
<TITLE> between <TI and TLE>,, but other than that,
it's pretty much fair game.)
 
So it appears that it's not necessary or even wise to model the HTML
data format as an SGML document entity, but rather an SGML text
entity.  That is, the way to validate/parse an HTML document is not to
sick the parser on the text/html body part itself, but on a document
consisting of two entities: the HTML DTD entity, and the text/html
body part.
 
If we were talking about a text/c-program content type, what I
was suggesting would be like putting the line:
 
#include <stdlib.h>
 
at the top of every text/c-program body part. What I'm suggesting
now is like assuming every text/c-program gets stdlib.h prepended
before compiling.
 
This makes an assumption that text/html data has this HTML DTD
entity in front of it all the time, but that assumption has always
been there.
 
Besides, forcing text/html parser to grok SGML document entities
creates some sticky issues -- we'd have to limit the prologue
to the simple <!DOCTYPE HTML SYSTEM>, and that's not really legal.
You're supposed to be able to do things like:
 
<!DOCTYPE HTML SYSTEM [
<!ENTITY smiley ":-)"> <!-- add my own "macro" -->
]>
<HTML><TITLE>The history of the smiley: &smiley;</TITLE>
...
 
If we adopt this change of perspective, we should make it clear
in the HTML specification that for the purposes of SGML, a text/html
body part is not an SGML document entity, but an SGML text
entity. The html.dtd entity and the text/html body part text entity
comprise an SGML document. I need to update html.dtd, fix-html.pl,
and the www_and_frame materials to reflect this change of
perspective.
 
By the way: this change makes it more staightforward to use an
SGML declaration other than the default, e.g. to increase NAMELEN
to allow tags larger than 8 characters. Should we do that while
we're at it?
 
Dan
 
p.s. Check out the MidasWWW browser. It's long overdue in the WWW
project, but it's worth the wait!