Re: A28: syntax of markup declarations? (LONG)

 A.28 Should XML use the markup-declaration syntax described by ISO 8879
 clauses 10-11, or should XML define a specialized document type and let
 its markup declarations use the document-instance syntax, as proposed
 by MGML?

This is a long posting in favor of using a subset of SGML markup declaration
syntax for XML markup declarations.

1. Terminology and Examples
2. Motivation
3. SGML Interoperability Issues
4. HTML Interoperability Issues
5. Appropriateness, Timing, and Jurisdiction Issues

1. Terminology and Examples

If we use instance syntax for XML DTD's, then to avoid misunderstanding
we shouldn't call them DTD's.  I have been using the term "Document
Structure Declaration" (DSD) since XML syntax is hardwired, and what's being
declared is really structure.

If we are using DSDs, and DSDs are XML documents, then clearly there must
be a DSD for DSD's - I have been calling this the XML Reference DSD.

I have made several documents available to help out:

A draft of the XML reference DSD:   http://www.textuality.com/dsd/xml-ref.dsd

An SGML DTD for xml-ref.dsd:        http://www.textuality.com/dsd/xml-ref.dtd

A draft of a simple DSD for papers: http://www.textuality.com/dsd/paper.dsd

The initial draft of the paper on this subject that I submitted to SGML'96 -
accepted, but superseded by our work in this group. Only here to provide a 
reference point for paper.dsd:      http://www.textuality.com/dsd/paper.xml

I should note that xml-ref.dsd includes some significant contributions
of both thought and syntax from Michael Sperberg-McQueen.

Whether you want to read this stuff before or after the rest of this posting
is up to you.  I would advise waiting; the issue is not (supposed to be)
the quality of the DSD, but the advisability of having one.

2. Motivation

2.1 Quality of Specification

This is close to my heart as co-editor of the XML spec.  If we use DTD
syntax, the spec will have to: (a) specify the DTD syntax, (b) specify how
the declarations in the DTD syntax impose constraints on instances.  If
we use instance syntax, we lose almost all of (a); it turns out that if
you explain how recognizers for elements and attributes work, then you
can do most of the job just by listing fragments from the Reference DSD
and explaining the effect of the various elements and attributes.  Michael
Sperberg-McQueen has provided a minimalist literate-programming kind of DTD 
so that the spec can actually include the master copy of the reference DSD.

If we use instance syntax, we will have a shorter, more elegant, more
airtight, and more comprehensible spec, and something upon which the
prospective XML processor author can easily get bootstrapped.

2.2 Integrity

I for one, once we get XML done, plan to start browbeating the document
management and authoring practitioners of the world along the lines of "now
that XML exists, you have NO EXCUSE for not using descriptive markup in your
structured documents!"  If I convince them, and they then start using XML,
and the first structured document they encounter is a DTD, with its rather
ad-hoc syntax, then I've lied to them.  I find this completely unacceptable.

2.3 Ease of Implementation

If there is only one syntax for declaration and instance, then the
prospective XML processor author only needs one lexer and one parser.
Granted, the DTD language is not the hardest in the world, but two
lexer/parsers are always harder than one to build.  Once you've built one
little perl/VB/rexx/Java thingie that can pull apart an XML instance (which
we're planning to make easy), you can then pull apart XML declarations too.

2.4 Familiarity

If we can go to the HTML community and tell them "not only can you now add
your own structures to documents and still deliver them on the web, but
here's how you do it, and it's just another bunch of tags", this removes
another significant barrier to resistence.  I cannot in good conscience
defend to these people the proposition that to do this something
straightforward, obvious, and good, they have to learn another language.

2.5 Tools

If DTDs are instances, then you can use your existing SGML editors and
document management systems and searchers and other value-adds to manage
them - thus bringing an important class of metadata into the domain of an
important class of tools.  Yes, I think that formatting directives and
search-metadata and entity catalogs should also be in SGML!

2.6 Other Improvements

The proposed reference DSD makes it impossible to declare pernicious mixed
content.  Perhaps there are other things that could be wired in as well.

3. SGML Interoperability Issues

There are some problems here.  First of all, (a) how do we deal with the 
fact that SGML documents need to have a <!DOCTYPE and a subset, and if XML
does too, then (b) how do you keep SGML parsers from stumbling over this
weird syntax, and (c) how do you deal with the fact that you have to maintain
two copies of these things?

(a) XML processors must be prepared to read (and ignore) a <!DOCTYPE
(b) I propose hiding the XML markup declarations from SGML processors inside
    a processing instruction as follows: <?XML DSD ... >.
    Similarly, I propose <?XML SSD for XML declaration subsets.
    [note: either we change PIC, or we live with a lot of &gt; in markup 
    declarations, or we figure out a better way to hide XML markup 
(c) By virtue of the ERB's resolution of 2 october, XML DSD's must be
    trivially & mechanically transformable into SGML DTD's.  So to avoid
    maintaining two copies, you have an idiotic little processor (which
    dozens of people on this group could write by tomorrow) that 
    (1) reads an XML instance,
    (2) finds the XML DSD & subset, generates an equivalent SGML DTD,
    (3) generates a <!DOCTYPE with a pointer to the new DTD, and a subset
        containing the SGML versions of any declarations in the XML subset. 

4. HTML Interoperability Issues

No problem here, due to a sleazy trick.  The XML DSD language
contains no character data anywhere - all text is in attribute values, and
all elements are either EMPTY or have element content.  So you can
just drop the whole thing anywhere into an XML-masquerading-as-HTML
document, and it will all be ignored by an HTML processor.

You *may* have to remove the PIO and PIC, unless HTML decides to learn PI's.

5. Appropriateness, Scope, and Timing Issues

The argument has been advanced that "this is a good idea, but we don't
have enough time to do it properly, and anyhow it's a job for WG8."  In
fact this may be correct, I can only say "I disagree."  Because:

(a) I think that the simplicity and ease of understanding of the DSD option
    greatly increases XML's chances of acceptance.  
(b) I think that we need this done on an Internet timescale, not a WG8 
(c) We have promised a converter to DTD notation, so if we make design 
    errors, there is an easy escape hatch for SGML people.  
(d) We have at the moment a rare confluence of talent, focus, and
    energy, and thus a chance to make it happen, which will not be repeated.
(e) It has been done before *at least* by Exoterica and Wayne Wohler and 
    Michael Sperberg-McQueen.
(f) WG8 will do a better job if they have a large-scale working experiment 
    on the Web to learn from.
(g) We don't have the time to do a proper document design for DTD's, but
    we don't have to: Goldfarb et al did that 10 years ago.

This is the RIGHT THING TO DO.  Looking back in a few years, it will be
much easier to justify having made some errors in this effort, than it
will be to justify having let the opportunity slip away.

Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-488-1167

Charles F. Goldfarb * Information Management Consulting * +1(408)867-5553
           13075 Paramount Drive * Saratoga CA 95070 * USA
  International Standards Editor * ISO 8879 SGML * ISO/IEC 10744 HyTime
 Prentice-Hall Series Editor * CFG Series on Open Information Management