Doctypes, Declarations, and HTML Versions

Some recent threads have raised issues regarding the doctype declaration
in HTML documents (e.g. Bert Bos' mention of their omissibility per RFC
1866, and my citation of a usenet post by Dan Connolly clarifying that
internal subsets are ostensibly verboten.) 

I find the relevant sections of the 4.01 spec (which seems to be the same
as 4.0 in this respect), 7.1 and 7.2, quite unhelpful, if not seriously
misleading.  By confusing an SGML syntactic function with an HTML semantic
function, the spec has made a mystical incantation out of the doctype
declaration. 


1.  Section 7.1 states

:  An HTML 4.01 document is composed of three parts:
:
:     1. a line containing HTML version information,
:     [...]

and Section 7.2 explains this as

:  A valid HTML document declares what version of HTML is used in the
:  document. 

which looks fine, merely as a desideratum, but then there's this:

:  The document type declaration names the document type definition (DTD)
:  in use for the document (see [ISO8879]). 

Unfortunately, this statement - as an assertion about naming - has *no*
basis in ISO8879.  Moreover, in the relatively obvious semantic intent, it
is flat out wrong. 

If a normative reference to ISO8879 is to be invoked at all, then at the
least it needs to made very clear that, for the purposes of HTML alone,
neither required nor sanctioned by ISO8879, certain extra application
specific conventions are being mandated.  This is because, per ISO8879, it
is *not* a function of the doctype declaration to identify a "version",
much less do so specifically in the form of a tactically convenient FPI
with a public text class of DTD.


2. From ISO8879 Clause 4 "Definitions":

| 4.103 (document) type declaration: A markup declaration that formally
| specifies a portion of a document type definition.
| NOTE - A document type declaration does not specify all of a document
| type definition because part of the definition, such as the semantics
| of elements and attributes, cannot be expressed in SGML. [...]

| 4. 105 document (type) definition: Rules, determined by an application,
| that apply SGML to the markup of documents of a particular type.
| NOTE - Part of a document type definition can be specified by an SGML
| document type declaration.  Other parts, such as the semantics of
| elements and attributes, cannot be expressed formally in SGML. [...]

The basic point is that the purpose - indeed, the only purpose - of a
doctype declaration is *syntactic*: to incorporate the machine-processable
part of a document type definition (the declaration subset.)  This subset
is logically and syntactically an integral part of the document: it is
needed in order to complete a parse according to SGML rules.  (That's why
the WebSGML TC has made doctype declarations optional, for cases where no
information beyond the instance data is needed in order to complete an
unambiguous parse.)  That part or all of the definition may come through
an external reference (analogous to #include in C) is irrelevant.

Nowhere in the HTML spec is this syntactic function specifically pointed
out.  Its importance lies in the fact that, only for SGML conformance,
there is a necessary relation between the declaration subset and the
instance markup: they must be mutually consistent.  So, if there is to be
a declaration subset at all, it must describe the markup actually used
in the document.  There is no ISO8879-sanctioned reason to have a doctype
declaration (for the DTD it incorporates) at all, otherwise.


Of course, this leaves open the real issue, which is how to convey the
*semantic* import of a version specification.  Unofrtunately, ISO8879
doesn't provide a way.  All we know is that the doctype declaration
definitely does not qualify.

 <URL:http://www.deja.com/=dnc/getdoc.xp?AN=325927738>



Arjun

Received on Saturday, 2 October 1999 01:43:28 UTC