DTDs and XML conformance

Here's the writeup I mentioned earlier.  I tried to be thorough, but I may
have missed some things.  Comments are welcome, of course.

	Eve

			*		*		*

I've already been asked many times by clients whether their existing DTDs
should "conform to XML." In most of the cases, I believe the answer is no,
simply because the focus is on delivery of SGML over the Web (the primary
goal of the XML effort in the first place), rather than validation of SGML
over the Web. At the same time, I've already written a couple of new
XML-conforming DTDs because the client felt it was simpler this way!

It seems to me that you could use a two-tiered approach to conformance that
depends on the circumstances of creation and delivery.

(Note that I'm not addressing the scenario of ad hoc tagging, where there's
no DTD in the picture, and the creator is likely using XML-conforming
instance syntax from the beginning. I've also pretty much ignored
declarations involved in the SHORTREF and LINK features.)

---------------------------------------------------------------------------

Instance and Internal Subset Conform to XML Constraints

Why: Web delivery of instances, where any characteristics of the DTD worth
transmitting (such as architectural forms-type attributes and entity
declarations) are put into the internal subset as part of XML delivery.

The following list assumes that for any one instance, a portion of the DTD
might need to be sent in the internal subset. Below, "transformation"
refers to automatable preparation of such portions before they are
extracted and placed in the internal subset (precise details on which
declarations must be extracted aren't given here; maybe I'll get around to
that later).

What:

  1. The instance has to be well-formed: special empty-element and PI
     syntax, normalization, etc.

  2. Either element type declarations can't use CDATA or RCDATA declared
     content, or the elements' content in the instance must be transformed
     to escape the appropriate characters that look like markup

  3. The DTD should avoid attribute value defaulting if you want to
     minimize the need to put attribute list declarations in the internal
     subset (use #IMPLIED plus a style sheet instead); if default values
     are supplied, they must be quoted

  4. Attribute declared values can't be NAME[S], NUMBER[S], or NUTOKEN[S]
     (probably use NMTOKEN[S] instead, but also possibly CDATA)

  5. Attribute default values can't use #CURRENT (no good substitute)

  6. Attribute default values can't use #CONREF (use #IMPLIED plus a style
     sheet instead)

  7. Either SDATA entities can't be referenced, or SDATA entity references
     must be replaced with decimal or hexadecimal character references (or
     whatever substitute is appropriate) in the instance

  8. Either CDATA entities can't be referenced, or the entity type must be
     changed and the contents transformed to escape characters that look
     like markup

  9. Bracketed entities can't be referenced (in general, these make
     ill-formed entities because they contain only half of a markup
     construct)

 10. SUBDOC entities can't be referenced (it might take quite a bit of work
     to extricate and transform any uses of SUBDOC entities)

 11. Entity declarations must not have data attributes specified

 12. External entity declarations must conform to PUBLIC/SYSTEM syntax
     requirements

 13. DTD marked sections must be either transformed to remove any spaces
     around status keywords, or resolved; the TEMP keyword can't be used

 14. Parameter entities either conform to whatever ends up being allowed,
     or are transformed or resolved

 15. DTD comments within markup declarations are either removed or are
     transformed to be moved outside and turned into full comment
     declarations

---------------------------------------------------------------------------

Instance, Internal Subset, and External Subset Conform to XML Constraints

Why: Validation of a document using XML tools that are not also validating
SGML parsers. I consider this an unlikely scenario, given the clamor for
many kinds of validation that SGML can't do today and given the desire to
do ad hoc tagging even when there's a DTD present.

The following list assumes that it's desirable to use the same DTD for SGML
and XML applications, without transformation.

What:

  1. As in the above scenario, the instance has to be well-formed: special
     empty-element and PI syntax, normalization, etc.

  2. Either element type declarations must contain no omitted-tag
     minimization specifications, or the specifications must be
     parameterized (according to the current XML-Lang spec) and resolve to
     null strings in the XML version

  3. Element type declarations can't use content model exceptions

  4. Element type declarations can't use AND (&) content models

  5. Element type declarations can't use CDATA or RCDATA declared content
     (use CDATA sections in the instance instead)

  6. Unlike the above scenario, the DTD can freely use attribute value
     defaulting; the default values must be quoted

  7. As in the above scenario, attribute declared values can't be NAME[S],
     NUMBER[S], or NUTOKEN[S] (probably use NMTOKEN[S] instead, but also
     possibly CDATA)

  8. Attribute default values can't use #CURRENT (no good substitute)

  9. As in the above scenario, attribute default values can't use #CONREF
     (use #IMPLIED plus a style sheet instead)

 10. SDATA entities can't be declared or referenced

 11. CDATA entities can't be declared or referenced (use CDATA sections
     instead)

 12. Bracketed entities can't be declared or referenced

 13. SUBDOC entities can't be declared or referenced

 14. As in the above scenario, entity declarations must not have data
     attributes specified

 15. Notation declarations must not have data attribute list declarations

 16. As in the above scenario, external entity declarations must conform to
     PUBLIC/SYSTEM syntax requirements

 17. DTD marked sections must be have no spaces around status keywords; the
     TEMP keyword can't be used

 18. Parameter entities must conform to whatever ends up being allowed

 19. DTD comments must be in full comment declarations, outside other
     markup declarations

---------------------------------------------------------------------------

Additional XML-Related DTD Design Considerations

Whether your SGML tools have support for the TC version of SGML...

Received on Tuesday, 3 June 1997 19:40:36 UTC