Re: XML and required DTDs

Well, I've let things pile up and have read hundreds of e-mails
now... I was going to post a summary of my views on issues that 
concerned me, but this one grabbed my attention:

> If I understand them correctly, it seems that Paul Grosso, Len
> Bullard, and Robin Cover have all raised the same concern, by
> pointing to situations in which it's necessary to have a DTD, whether
> for authoring, for contractual purposes, or for other applications.
> 
> Let's distinguish four cases:
> 
> a DTDs Required for Parsing.
> 
> Declarations are always required, and in practice all applications must
> process them on each run, since it's impossible to parse the document
> correctly without knowing which elements are EMPTY, which are CDATA,
> etc.  (It's also possible to cache the essential information in some
> other form, but that's just a standard store-for-compute tradeoff.)
> This, roughly, is the situation with 8879:1986.
> 
> b DTDs Required for Validation
> 
> Declarations are always required, but the language is so constructed
> that the document can at least be parsed correctly* without reading the
> DTD.r declarations need not always be read.  I.e. it's possible to do
> some kinds of useful work even without reading all the declarations.
> This, roughly, is what SGML would be like if the ETAGC proposal is
> adopted (at least for Minimal SGML documents without references to
> external entities).
> 
>      * (within some limits -- element boundaries and content
>      will be correctly identified; some non-significant white
>      space may be preserved unnecessarily)

Since there is not ETAGC delimiter, this isn't SGML.  

> c DTDs Optional
> 
> Declarations are always allowed, but not always required; the system
> makes certain default assumptions if no declarations are provided.  This
> is the approach taken by PSGML.  (Or C, if a programming-language
> analogy is useful.  In programming, I don't find this helpful at all,
> but Tim and others have suggested plausibly that it may be useful in
> XML.)

This is certainly acceptable.  There are some rules that could be 
applied to this... Essentially, you could still deliver an SGML document
with having to change the reference concrete syntax or convert
to some "SGML-like" language.
 
> d DTDs Forbidden
> 
> Declarations are never allowed; the system makes certain assumptions
> about things, and your usage had better agree.  I don't know of any
> serious markup languages that do this; the only analogy I can think of
> is Basic, in the form that requires names of string variables to end in
> $ and so on.

Ick.  This is like saying: "You can't define you classes to your complier,
it just *knows*!"

> Cover, Grosso, and Bullard seem to be arguing against (d), but I'm not
> sure whom they are arguing against.  My own reading of the goals
> document is that (b) or (c) should apply -- or at least, that (a) is not
> what's wanted here.  I don't think anyone is actually in favor of (d),
> and if the current phrasing of the goals statement gives readers that
> impression, then it needs to change.  Can someone who finds the current
> phrasing confusing suggest a less confusing alternative?

What I don't understand is why (a) is not wanted?  What is the big deal 
about DTDs?  A document type is a contract between application--including
a browser--and the document.  All bets are off if you don't conform to
the DTD.  It might be ok, and then again, it might not.

Having worked a great deal in financial and legal printing, the issue
of validation is quite important.  Without a DTD, you can't even assume
the document was authored correctly, let alone, be legally compliant.
For legal-related publishing systems, DTD are *necessary*.  I would like
to think that XML is going to be able to be used for legal documents.

Since a browser need not be a validating parser, the DTD can serve as
a rule for whether or not a particular element is empty and what 
attributes are implied, etc.  

I just don't see the need to say that we don't want definitions--or even
further, to say that we don't need validation.
 
It is very interesting to see HTML authors realize the benefits of
validation.  Once they know that their document is conformant, they can
proceed to verify that it looks correctly.  If a valid document is
not interpreted correctly by tool X, tool X is broken or misconfigured.

Obviously, I'm in favor of (a).

> A validating editor just doesn't belong to the class of applications for
> which declarations are inessential.  When presented with an XML document
> which has no DTD, it might (a) warn about missing declarations, (b)
> silently assume <!ELEMENT foo - - ANY>, or (c) something else entirely.

I'd be happy with (a) if at all possible, and some default mode if
declarations are not possible.  The problem is that this allows applications
to never check for validation.  Then we are back to square one in that
someone can just toss in some element, however they please, and expect
it to work.

We all have to have rules to play by.  The question is who is in control
of those rules.  The point of SGML was to put that control back into the 
hands of the user and producer of the information, not in the hands of
the software vendor.  XML should have that goal as well.

DTDs are a necessary component.

BTW, why call XML XML?  Why not "conventions for SGML on the Web" etc.  Do
we really want to create "Yet Another Markup Language" (YAML) and go against
ISO standards?  I realize that there are problems *some* parts of SGML.

We should really try to define a set of conventions that work within
ISO standard SGML and work with SGML Open and the ISO working group to
allow SGML to adjust to fulfill needs that fall outside of ISO 8879.

Lets get onto the more important issues:

   * Use of HyTime for hyper linking.
   * Formal System Identifiers and Entity Management
   * Use of DSSSL for active applications  (e.g. DSSSL transformations
     can specify active transformation of documents in the clients
     browser.  Imagine CGIs becoming self-transforming documents).
   * Transportation of SGML, entities, catalogs, DSSSL in a sane fashion.

And if you really want something to think about:

   * DSSSL defines a standard post-parsed form for a document--a grove.
     Since this is standard, it should be possible to define a linearization
     of a grove into a content type such that an application has to make
     no decisions upon reading the data.  It just re-formulates the 
     grove into a tree-like construct (or whatever).  No DTD here!
     No Parsing Here!   ...hmmmm, sounds like BSGML!

A wise man once said:  A hard problem is always a hard problem.

Generic and extendible markup for use in distributed systems is a hard
problem.

==============================================================================
R. Alexander Milowski     http://www.copsol.com/   alex@copsol.com
Copernican Solutions Incorporated                  (612) 379 - 3608

Received on Tuesday, 17 September 1996 23:45:16 UTC