simplifying comments in SGML '97

At SGML '96, there was a fair amount of discussion of aspects of 8879
which had forced XML into uncomfortable (or in some views unnatural)
positions on this issue or that, and some members of WG8 have
indicated an interest in issuing a technical corrigendum to the
standard, to change those aspects of SGML, at least in the cases
where it's clear and widely agreed what should change and how.

In discussing this possibility with members of WG8, the ERB has
encountered a question on which we need the guidance of the WG,
namely, what to do about the syntax of comments in SGML.

The XML spec of 14 November defines as a 'comment' what in 8879 is
called a 'comment declaration' -- to reduce confusion, in what
follows I am going to try to use the 8879 terminology, not the XML
terminology.

The current version of the XML spec differs from 8879 in various ways:

  (a) XML allows comments *only* in comment declarations; 8879
      allows them in other markup declarations as well (though not
      in all locations)
  (b) XML allows exactly one comment in a comment declaration;
      8879 allows zero or more
  (c) XML defines two delimiters '<!--' and '-->' which bound the
      construct in question; 8879 sees three delimiter roles involved
      here, and allows white space in some locations:  XML '<!--'
      corresponds to 'mdo, com' in 8879 (this is split across
      productions 91 and 92, pp. 391 of the Handbook) and XML
      '-->' corresponds to 'com, s*, mdc' (again, productions 91-92)

N.B. I think this list is complete but may be wrong.

The main perceived problem with the XML spec is that the comment itself
is barred not only from containing '-->' (its closing delimiter) but
'--'.  Changing the SGML com delimiter from '--' to something else
(e.g. ';;') would change the forbidden string to a less frequently
encountered one, and was considered, but it would not eliminate the
apparent irrationality and was felt in any case to be unwise.

A technical corrigendum to 8879 would ideally eliminate this problem
and make it possible for '--' to appear within XML-style comment
declarations; it might also make it possible for SGML parsers to
enforce XML's rules on (a) number of comments within the comment
declaration and (b) absence of white space before the closing mdc.

So far so good.

The problem is that there seem to be at least two ways of approaching
the problem, and it's not clear which is preferable.  Your opinions,
please.


A.  The Simple Comment

This proposal adds a SIMPLEC (simple-comment) optional feature to
the SGML declaration.  If SIMPLEC NO is declared, comments behave
as they do now.  If SIMPLEC YES is declared, comments have an
alternative definition.  The relevant clause might read like this
(thanks to Dave Peterson for the draftsmanship).  I have marked
additions in <add>...</add> and substitutions in <sub>...</sub>:

    10.3 Comment declaration

    <add>[91] comment declaration = normal comment declaration
                                  | simple comment declaration
    </add>

    <sub>[91a]</sub> <add>normal</add> comment declaration =
                     mdo, (comment, (s | comment)*)?, mdc

    [92] comment = como, SGML character*, comc

    <add>
    [92a] simple comment declaration =
                     mdo, com, SGML character*, com, mdc
    </add>

    No markup is recognized in a comment, other than the com
    delimiter that terminates it.  <add>No markup is recognized in a
    simple comment declaration other than the com delimiter
    immediately followed by an mdc delimiter that terminates
    it.</add>

    <add>
    NOTES

    1. A com delimiter not followed by an mdc delimiter will be
    recognized in a comment (in a comment declaration or other
    declaration) but not in a simple comment declaration.

    2. The SGML declaration specifies whether normal or simple
    comment declarations are used in a document.  No document may
    use both.
    </add>

Advantages:  captures all of XML's rules except the prohibition on
comments in other markup declarations.  Disadvantages:  not clear
whether the precise mix of simplifications undertaken by XML is of
general enough interest / use to warrant this approach:  would other
application profiles prefer to impose different rules on comments?


B.  Splitting the com delimiter.

This proposal simply replaces the 8879 com delimiter with a pair of
delimiters, como and comc (comment open and comment close).  Documents
using 8879:1986 syntax have como = comc = com; in the RCS that's
como = comc = '--'.  XML could allow -- within comments by setting
como to '--*' and comc to '*--', to retain the general look and feel
of current comments, and still allow '--' in the comment itself.
Nested comments might also become possible, in SGML (and then in XML),
or not.

Production 91 of 8879 could remaing the same as it now is; 92 would
change.  We might have:

  91 comment declaration = mdo, (comment, (s | comment)*)?, mdc
  92 comment = como, SGML character*, comc

or (to allow nested comments)

  92 comment = como, (SGML character | comment)*, comc

Note:  If this is what we propose to WG8, the XML spec should probably
change *now* to use these delimiters, replacing production 21 with

  [21] Comment ::= '<!--*' [µ-]* ('-' [µ-]+)* '*-->'

Advantages: this seems relatively simple and relatively compatible
with the look and feel of 8879 as a whole.  Disadvantages:  it
doesn't allow the SGML parser to enforce XML's rules.

What does the SGML Work Group think about this problem?

-C. M. Sperberg-McQueen

Received on Wednesday, 11 December 1996 14:55:44 UTC