Message-Id: <9211171453.AA07331@pixel.convex.com> To: Edward Vielmetti <emv@msen.com> Cc: www-talk@nxoc01.cern.ch Subject: SGML Cop backs off In-Reply-To: Your message of "Tue, 10 Nov 92 15:13:07 EST." <m0mp1xh-00009MC@garnet.msen.com> Date: Tue, 17 Nov 92 08:53:07 CST From: Dan Connolly <connolly@pixel.convex.com> I've reconsidered my position on the "framing tags" in HTML after a more careful consideration of the SGML standard, and after receiving the O'Reilly/HaL DocBook materials and the MidasWWW browser. To refresh your memory... > Currently HTML documents are transmitted without the normal SGML framing > tags, but if these are included parsers will ignore them. > >I don't know what "the normal SGML framing tags" are. An SGML document >has three parts: the SGML declaration, the prologue, and the instance. >It is common in SGML applications to use an implied SGML declaration >and include the prologue by reference (kinda like an #include >directive in C.) but without these "framing tags," it's just not an >SGML document. The SGML standard is big on the distinction between Entities and everything else; that is, the physical breakup of an SGML document into storage units such as files, directories, MIME body parts, collectively "entities" is pretty much arbitrary (you can't break <TITLE> between <TI and TLE>,, but other than that, it's pretty much fair game.) So it appears that it's not necessary or even wise to model the HTML data format as an SGML document entity, but rather an SGML text entity. That is, the way to validate/parse an HTML document is not to sick the parser on the text/html body part itself, but on a document consisting of two entities: the HTML DTD entity, and the text/html body part. If we were talking about a text/c-program content type, what I was suggesting would be like putting the line: #include <stdlib.h> at the top of every text/c-program body part. What I'm suggesting now is like assuming every text/c-program gets stdlib.h prepended before compiling. This makes an assumption that text/html data has this HTML DTD entity in front of it all the time, but that assumption has always been there. Besides, forcing text/html parser to grok SGML document entities creates some sticky issues -- we'd have to limit the prologue to the simple <!DOCTYPE HTML SYSTEM>, and that's not really legal. You're supposed to be able to do things like: <!DOCTYPE HTML SYSTEM [ <!ENTITY smiley ":-)"> <!-- add my own "macro" --> ]> <HTML><TITLE>The history of the smiley: &smiley;</TITLE> ... If we adopt this change of perspective, we should make it clear in the HTML specification that for the purposes of SGML, a text/html body part is not an SGML document entity, but an SGML text entity. The html.dtd entity and the text/html body part text entity comprise an SGML document. I need to update html.dtd, fix-html.pl, and the www_and_frame materials to reflect this change of perspective. By the way: this change makes it more staightforward to use an SGML declaration other than the default, e.g. to increase NAMELEN to allow tags larger than 8 characters. Should we do that while we're at it? Dan p.s. Check out the MidasWWW browser. It's long overdue in the WWW project, but it's worth the wait!