Message-Id: <9211102338.AA02403@pixel.convex.com> To: Edward Vielmetti <emv@msen.com> Cc: www-talk@nxoc01.cern.ch Subject: Re: proposed registration of type 'text/html' for MIME In-Reply-To: Your message of "Tue, 10 Nov 92 15:13:07 EST." <m0mp1xh-00009MC@garnet.msen.com> Date: Tue, 10 Nov 92 17:38:19 CST From: Dan Connolly <connolly@pixel.convex.com> >Here's the form for registering 'text/html' partly filled in, from RFC >1341. I strongly suggest we bring the definition of HTML into conformance with the SGML standard before we register it with the IANA. >Published specification: > "The HTTP Protocol as Implemented in W3", avaiable for > anonymous ftp from ftp://info.cern.ch/pub/doc/www/http.txt. > Describes the HTTP interactive access protocol and the tags used > in HTML documents. This is the HTTP document, not the HTML document: This document defines the Hypertext Transfer protocol (HTTP) as currently implemented by the WorldWideWeb initaitive software. The HTML document is: http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html an old version of which is contained in http.txt. In any case, both documents mention some relationship between HTML and SGML which is not formally defined: The hypertext mark-up language is an SGML format. This defines the basic syntax used. The particular language, the set of tags and the rules about their use, and their significance is not part of the SGML standard. There being no standard on this, we have adopted a set which seems sensible. We call them HTML -- hypertext markup language. HTML is not an alternative to SGML, it is a particular format within the SGML rules (an SGML "DTD"). The standard is very clear on this kind of thing. [I just got myself a copy, so I can quote it:] 4.103 (document) type declaration: A markup declaration that contains the formal specification of a document type definition. 4.104 document type delcaration subset: The element, entity, and short reference sets occuring within the declaration subset of a document type declaration. 4.105 document (type) definition: Rules, determined by an application, that apply SGML to the markup of documents of a particular type. A document type definition includes a formal specification, expressed in a document type declaration, of the element types, element relationships, and attributes, and references that can be represented by markup. It thereby defines the vocabulary of the markup for which SGML defines the syntax. So it seems that the HTML DTD is missing the "formal specification." I have written a document type declaration subset that matches HTML as currently defined and implemented, with a few exceptions (most importantly, the PLAINTEXT tag). See http://info.cern.ch/hypertext/WWW/MarkUp/HTML.dtd Most existing HTML documents need only small modifications to bring them into conformance (quote attribute values, add the <!DOCTYPE ...> prologue). And the existing WWW browser parses conforming documents just fine. Currently HTML documents are transmitted without the normal SGML framing tags, but if these are included parsers will ignore them. I don't know what "the normal SGML framing tags" are. An SGML document has three parts: the SGML declaration, the prologue, and the instance. It is common in SGML applications to use an implied SGML declaration and include the prologue by reference (kinda like an #include directive in C.) but without these "framing tags," it's just not an SGML document. Besides, it's very little work to add the line: <!DOCTYPE HTML SYSTEM> at the beginning of HTML documents. More non-conforming stuff in Markup.html: Plaintext This tag indicates that all following text is to be taken litterally, up to the end of the file. Plain text is designed to be represented in the same way as example XMP text, with fixed width character and significant line breaks. Format: <PLAINTEXT> This tag allows the rest of a file to be read efficiently without parsing. Its presence is an optimisation. There is no closing tag. This should be moved outside the definition of HTML. It should just be part of the HTTP protocol: if the server starts the response with <PLAINTEXT>, what you're getting is plain text, not SGML. Another problem: Example sections The text may contain any ISO Latin printable characters, including the tag opener, so long as it does not contain the closing tag in full. This doesn't fit in SGML. The ETAGO delimiter ("</") ends a CDATA section. A clarification: Paragraph This tag indicates a new paragraph. The exact representation of this (indentation, leading, etc) is not defined here, and may be a function of other tags, style sheets etc. The format is simply <P> (In SGML terms, paragraph elements are transmitted in minimised form). The implementation suggests that the <P> tag marks an empty element, a paragraph separator, rather than allowing minimization in the form of an omitted end tag, </P>. We could even go so far as to call WWW an SGML application: 4.279 SGML Application: Rules that apply SGML to a text processing application. An SGML application includes a formal specification of the markup constructs used in the application, expressed in SGML. It can also include a non-SGML definition of semantics, application conventions, and/or processing. Note 2 The formal specification of an SGML application constitutes the common portions of the documents processed by th application. These common protions are frequently made available as public text. In other words, ftp://info.cern.ch/pub/doc/the_www_book.txt would serve as the "non-SGML definition." [by the way: I could only find postscript and LaTeX versions of the book: no txt file.] The "common portion" is html.dtd (we could obtain a public text identifier for it...). If we want to do this (define an SGML application) section 15.5 requires this statement to be plastered all over the place: An SGML Application Conforming to International Standard ISO 8879 -- Standard Generalized Markup Language If we're gonna use SGML, why not do it right? Dan