- From: Daniel W. Connolly <connolly@hal.com>
- Date: Fri, 09 Dec 1994 11:28:20 -0600
- To: Farrar@metamor.com
- Cc: Multiple recipients of list <www-html@www0.cern.ch>
In message <9412091638.AA26430@dxmint.cern.ch>, Brian Farrar writes: > >> I'm certain their must be an FAQ answer someplace that succinctly describes >> the differences and >> similiarities of SGML and HTML. Any pointers from anyone? OK... I'll bite... this should go in an HTML FAQ somewhere... maybe it already is... It's not a matter of differences and similarities, the way I see it: HTML is an application of SGML, the way LaTeX is an application of TeX, or the way the MS macro set is an application of troff, or the way differential equasions are an application of set theory. Some folks have said HTML is a subset of SGML. You could look at it that way: the set of HTML documents is a subset of the set of SGML documents. Each SGML document has three parts: an SGML declaration, a prologue, and an instance. The prologue is often called the DTD, and for the sake of this discussion, we'll let that slide. The DTD specifies a document type. Part of the specification of a document type is a sort of grammar that gives the order and occurence of the elements; e.g. "A Book shall consist of a preface and one or more chapters." The instance must _conform_ to the DTD. This business of conformance can be checked by machine. (This is probably the handiest feature of SGML over something like troff or TeX). So the set of SGML documents looks like { (decl, dtd, instance) : decl is an SGML declaration and dtd is an SGML DTD and instance is an SGML instance and instance conforms to dtd } For HTML, the decl and DTD are fixed; so the set of HTML documents looks like: { (html-decl, html-dtd, instance) : decl is the HTML SGML decl and html-dtd is the HTML DTD and instance is an SGML instance and instance conforms to html-dtd } OK... so much for theory. In practice, popular software that deals with HTML (e.g. NCSA Mosaic) doesn't support all the features of SGML. There are a few obscure bugs here and there, and there are a few major omissions. *** Entity management: Most of the omissions relate to the fact that SGML in general allows a prologue to have more than just a DTD, and it allows a document to consist of more than one entity (think of an entity as a file for now). You can sort of "customize" the DTD on a per-document basis. So while popular HTML software will only deal with this prologue: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> (and this is a happy coincidence: they deal with it by ignoring it.) a conforming SGML parser will let you write: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" [ <!entity buyer "Widget Co."> <!entity seller "Gadget Co."> <!entity agreement SYSTEM "agreement.html"> ]> &agreement; where agreement.html looks something like: <title>Agreement between &buyer; and &seller;</title> <h1>Terms and Conditions</h1> <ol> <li>&buyer; agrees not to shoot &seller;. <li>&seller; agrees not to shoot &buyer;. <li>&buyer; agrees to give all their money to &seller;. <li>&seller; agrees to give &buyer; some stuff. </ol> *** Marked sections: a conforming SGML parser will deal with markup like: <![ IGNORE [ lksjdflkjs<tags> data whatever ]]> and ignore it. You can also write: <![ CDATA [ <tags>, <!-- junk, &foo;, blah ]]> and everything between the []'s will be treated as regular data characters: the string '<tags>' won't be treated as a tag at all. Another use of marked sections is in combination with parameter entities, kinda like #defines and #ifdefs in C: The prologue for some SGML document might look like: <!doctype foo PUBLIC "-//foo corp//DTD foo//EN" [ <!entity % in-house "IGNORE"> ]> Then, in the instance, you might see: blah blah blah <![ %in-house; [ See Henry for details on how this works here at foo corp. ]]> All the in-house marked sections can be turned on and off by changing the in-house entity declaration in the prologue. Some SGML parsers, namely SGMLS, support a command-line switch for this, just like -D on a cc command. So you could get all the in-house stuff with: % sgmls -iin-house foo.sgm So popular HTML implementations are like C compilers that don't let you use the C preprocessor, or like a LaTeX conversion program that barfs if you define your own TeX macros. That doesn't mean that you can't feed real live conforming SGML documents to popular HTML implementations. The programs you gave to a C compiler that didn't support cpp would still be valid C programs: they'd just be painful to write. Unfortunately, unlike this hypothetical C compiler, popular HTML implementations also eat documents that are not valid SGML documents at all. First, they allow some kinds of syntax errors, like: <a href=foo/bar/baz.html> which should be: <a href="foo/bar/baz.html"> Also, popular HTML implementations don't check the order and occurence of elements with respect to any particular DTD. There is a DTD for HTML under discussion by the HTML Working Group of the IETF. See http://www.hal.com/~connolly/html-spec/ for details. There are certain markup idioms, like: <dl> <dt><h3> used H3 to get the font I like</h3> <dd> some text </dl> that the current HTML DTD doens't allow. The DTD for HTML _could_ be constructed to allow such idioms, but I don't think that would be a good idea, and most of the folks in the working group agree with me. You might say "but that markup works fine on all the browsers I've seen." My answer is that this is a happy coincidence, but no browser should be _required_ to support that sort of thing -- and you shouldn't _expect_ it to work with tools that may be developed in the future. In the future, we'd like folks to be able to build browsers that, for example, display a table of contents of your document along side the main text window. If folks use H3 just for font changes, then a TOC display would look silly. So that's my take on the difference between SGML in general, HTML in theory, and HTML in practice. Dan
Received on Friday, 9 December 1994 18:35:28 UTC