Make DTDs optional?

An Odd Idea
===========
At first, the Good SGMLer in me said that DTDs should be absolutely required
during authoring and that every document should validate to some DTD (and I
stated that on this list a few weeks ago). But then, I thought a little more
about it, and I'm no longer convinced that that is the case.

DTDs are absolutely necessary for most applications where SGML is used
today. SGML is almost always used today to enforce conformance to a document
structure. Since you couldn't, before XML, do much useful stuff with SGML
documents without DTD-specific coding, a document without a DTD would be
basically useless. But in the age of XML/DSSSL you can deliver it online, in
print and fulltext index it. In other words, all of the things people expect
to do with proprietary formats like Word for Windows and PDF and with
trivial formats like HTML and RTF. If we loosen the requirement for a DTD,
XML could do everything these formats do without making the creation process
of these documents more expensive.

The Advantages
==============
For instance, let's say Jane Author is working in Word for Windows. The
document she is creating does not conform to any DTD I know of. When its
done, however, she wants to deliver it as XML. Why? 

 * Because XML is widely supported (we hope). 
 * Because XML is compact (we hope). 
 * Because XML will preserve whatever structure exists in the document (and
Word allows quite a bit). 
 * Because XML is more portable, more device-independent and more widely
supported than Word for Windows format.
 * Because XML is "open" and standardized.
 * Because XML is easy to full-text index.

So doesn't it make sense for her to do a "save as XML" and get a structured,
portable, device-independant XML document for delivery on the World Wide
Web? I would expect the XML document to have one element for each paragraph
and character style. This document may not conform to any existing DTD, but
it might still take advantage of all of the other benefits of XML described
above.

The Alternative
===============
If we require a DTD of this author in this situation:

a) she will decide that encoding in XML is too much work and give up (and
they will lose out on its non-validation related benefits) OR

b) she (or Word) will create a trivial DTD that is of no value to anyone,
and actually ends up obscuring the fact that this document is NOT MEANT TO
CONFORM  to any particular application of XML OR 

c) she will throw away data by encoding the document in a very general DTD
like HTML (which is equivalent to not using a DTD at all) or a highly
non-prescriptive DTD like TEI (which is equivalent to creating a trivial DTD)

d) she will shoehorn their document into a DTD that is completely wrong for it.

In other words, the cause of Good Documents will probably not be advanced at
all.

What does an author lose by not requiring a DTD for every document?

 a) the document type is no longer self-describing. But how meaningful is a
document type descripton "<HTML>" or "<TEI.2>"? They tell you very little
about the type or structure of the document. (contrast this with the next point)

 b) the document format is no longer self-describing, so code cannot know
the semantics of the markup. That is a fairly big loss. In response, I would
say that there are thousands, perhaps hundreds of thousands of documents in
the world that were never meant to be processed by anything beyond a
browser/printer and a full-text indexer. These authors will never research
and choose a correct document type. Why shouldn't we make a standard that
encompasses their needs too? What do we lose?

 c) the document no longer conforms to some external standard. But as
mentioned before, the author may not care. 

But what about Rigour?
======================
Of course there are massive benefits to standardization. And in many
situations it makes a lot of sense to _require_ standardization. But merely
requiring a DTD does not enforce standardization, because there are so many
trivial and useless DTDs in the world (and an infinite number of them still
to be written). 

I believe that the structural quality of the average XML document will
actually GO DOWN according to my proposal. A lot of quasi-structured
documents will be put on the web as XML. But isn't it better that they be in
XML (which allows fine-grained descriptive markup) than in HTML, RTF or PDF?
I hope that under my proposal, the structural quality of the _average
document_ will go UP because so many documents that would have had their
structural features buried in a proprietary format will be encoded in XML
instead. I think that this is what we should be aiming for. 

"Encoded in XML/SGML" was never a guarantee of quality structural markup.
This situation is no better nor worse under my proposal.

But aren't DTDs integral to structural markup?
===============================================
My reading of Annex A.3 indicates that DTDs were primarily intended to be
used for markup minimization. Goldfarb says: "Document type definitions have
uses in addition to markup minimization." (a slight understatment =) )

But what about XML editors?
===========================
As demonstrated by my example above, I am trying to blow open the definition
of an XML editor to allow all standard word processors to become XML editors
without the expense of incorporating SGML parsers and validators. XML could
be the "trivial" ASCII export format of all of these tools.

SGML-editors would of course have a major place in the production of XML.
They would be known as "DTD-validating" XML editors. In other words, when
you DID care about conforming to a particular DTD standard, you would use an
SGML editor. And many structure-serious authors would want the control over
structure that an SGML editor would allow. After all, allowing Word for
Windows to export something that has SGML-style tags does not turn Word into
a great tool for creating structural documents. There would also be a huge
market for "porting" these "trivial dumps" into DTDs for incorporation into
an SGML system (just as people currently port proprietary formats into SGML,
except that the other formats would be easier to parse).

Increasing the number of descriptive-encoded documents in the world can only
help the SGML vendors (as the HTML experience has shown).

Could you summarize?
====================
I think that we should only require DTDs when DTDs are required. Which means
that the requirement for DTDs should not be in the XML standard but in the
standards built on top of it ("XML applications").

 Paul Prescod

Received on Saturday, 28 September 1996 10:30:49 UTC