Re: Production 21 (and others)

On Wed, 29 Jan 1997 10:26:52 -0500 Gavin Nicol said:
>Michael has the rule for comments:
>
>   "<!--*"([-*]|("-"("*""-"?)*[-*])|(("*""-"?)+[-*]))*("*"|"-*")+"-->"
>
>using flex as a test lexer, I tried the following successfully.
>
>   "<!--*"([^*]|("*"[^-])|("*-"[^-])|("*--"[^>]))*"*-->"
>
>I used the following as test data:

Some of which should pass through OK, and some of which should raise
errors.

>  <!--**-->
>  <!--* I am a comment *-->

OK.

>  <!--* * *- *-- *-->
>  <!--* * -* --* --*> *-->
>  <!--*
>     -- This is a big ugly comment --
>     *-- This is a big ugly comment --*
>     --* This is a big ugly comment *--
>     <--* This is a big ugly comment *--->
>     <--* This is a big ugly comment *->
>  *-->

All errors:  '--' os prohibited within XML comments, for compatibility
with SGML, and even after the TC passes, which will (in its current
form) allow for distinct como and comc delimiters (--* and *--), the
string *-- will, as Chris Maden points out, remain illegal for
compatibility reasons.

Terry Allen asks for a rationale.  What follows is mine; it's not
inconsistent with what others on the ERB think, that I know of, but it
goes well beyond what I know them to believe.

I think it's fair to say the arguments against the change were primarily
of the "why change it?  it's not broken" variety:  any change away from
the reference concrete syntax may confuse people who are used to the
RCS.  No arguments were made in favor of the '--' toggle on its merits.

The arguments in favor of changing the comment syntax were that a
prohibition on '--' in comments which are *not* ended by '--' makes no
sense and *is* clearly broken as designed.  And since it's broken solely
for reasons of compatibility, the question arises "What changes to SGML
might remove this problem?"  The discussion of this question seemed to
make clear that the preferred solution would be to distinguish
comment-open (como) from comment-close (comc) delimiters; this would
enable XML to have distinct como and comc delimiters, with comc
something other than '--'.  In particular, Lee Quin's suggestion to use
--* and *-- as como and comc seemed particularly good, since it can be
(mis)interpreted in terms of the RCS without catastrophic error.  So the
TC which is expected to go to the appropriate ISO body calls for the
introduction of como and comc delimiters.

When the TC passes, we want to be able to remove the prohibition on '--'
in comments.  This is possible only if XML defines strings which can be
used, in Full SGML systems, as the como and comc delimiters.  So we need
to do that, and we need to do it now, now after the TC passes.  Of the
various candidates for como and comc, --* and *-- won hands down:  they
are symmetrical, usable now, and not radically different from current
practice.

After the TC, XML can remove the prohibition on --, though not the
prohibition on *--.  And there will still be a minor inconvenience in
that comments will still be unable to nest, though there is no reason,
given distinct open and close delimiters, to forbid nesting of comments,
and in practice it would be very convenient to be able to comment out a
block of text without having to check first to see if it already
contained any comments.  (An IGNORE marked section can be used this way,
but I persist in the belief that marked sections are not comments, and
using them this way constitutes markup abuse.)

Finally, it should be pointed out that although the deterministic
regular expression for comments is complicated, the comment structure
itself is very simple and easy to describe:

  Comments begin with '<!--*' and end with '*-->'.
  Within a comment, anything is legal except '--'.

After the TC, these rules can change (and will, unless the ERB rescinds
its decision of earlier this month) to:

  Comments begin with '<!--*' and end with '*-->'.
  Within a comment, anything is legal except '*--'.

And at that point, the regular expression in the document can
presumably change to something more like Gavin's proposal above (though
not *exactly* the same, since Gavin allows '*--' if not followed by '>',
which XML will have to continue to forbid, for compatibility reasons.
I make the new rule out to be:

    "<!--*"([^*]|("*"[^-])|("*-"[^-]))*"*-->"

Since *-- is much rarer in normal text than --, which in many people's
ingrained habit is the expression of an em-dash, XML will be made
perceptibly more user-friendly when the change is made.  To make the
change, however, it's necessary to change the XML comment delimiters.
This was enough to persuade the ERB, including myself, that the change
was a necessary and good idea.  I hope it persuades you, too.

-Michael

Received on Wednesday, 29 January 1997 12:08:33 UTC