Re: Production 21 (and others)

On Wed, 29 Jan 1997 22:34:33 -0500 Liam Quin said:
>Quoth Gavin:
>>    "<!--*"([^*]|("*"[^-])|("*-"[^-])|("*--"[^>]))*"*-->"
>
>But I think there is a flaw, as this will not match
><!--**-*-->
> ...

As Tim Bray said a few days ago in private mail,

   Oooooooh!  My brain hurts... make it stop.

Lee Quin's note made me look more carefully at the revised regular
expression that Gavin Nicol and I had both independently come up with,
and it too has problems.  After a few more minutes (an hour, actually,
all told) of struggle, I now have the following two regular expressions.
Those with the Itch, please check these out!  If it's so easy to get
this wrong, we need as many checkers as possible.

1 an expression for XML as of now, forbidding '--' within comments

 [21] Comment :=
 "<!--*"([µ-*]|("-"("*""-"?)*[µ-*])|(("*""-"?)+[µ-*]))*("*"|"-*")+"-->"

2 an expression for the XML rule of the future, forbidding *-- but
not -- within comments:

 [21] Comment-of-the-future :=
 "<!--*"([µ*]|(("*""-"?)+[µ-*]))*("*"|"-*")+"-->"

Since I'm not sure what the EBCDIC/ASCII translation is going to do to
these, here's another version; I hope the circumflex is right in at
least one of them:

 [21] Comment :=
 "<!--*"([^-*]|("-"("*""-"?)*[^-*])|(("*""-"?)+[^-*]))*("*"|"-*")+"-->"
 [21] Comment-of-the-future :=
 "<!--*"([^*]|(("*""-"?)+[^-*]))*("*"|"-*")+"-->"

In case it's helpful, here is my derivation for the second rule,
following the same logic as my derivation of the first rule, posted
a day or so ago.

First, define Misc as any string of characters ending in something other
than a hyphen or a star, and containing no '*--':

Misc     [µ*]|(("*""-"?)+[µ-*])

And Star as any string of hyphens and stars ending in a star and
containing no '*--'

Star     ("*"|"-*")+

Then a comment is a sequence of:
  - the start delimiter '<!--'
  - any number of Misc strings
  - a Star string to start the final delimiter
  - '-->'

or
 [21] Comment := "<!--*"({Misc})*{Star}"-->"

Expanding out, we get

 [21] Comment :=
 "<!--*"([µ*]|(("*""-"?)+[µ-*]))*("*"|"-*")+"-->"


Test cases:  all these comments are legal in both rules.

<!--* this is a comment *-->
<!--* this comment - how odd - has two single hyphens        *-->
<!--* the next has a whole series of hyphens and blanks      *-->
<!--* - - - - - - - - - - - - - - - - - - - - - - - - - - -  *-->
<!--* *- *- *- *- *- *- *- *- *- *- *- *- *- *- *- *- *- *-  *-->
<!--* ********************************************************-->
<!--* -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-->
<!--* <p>This is a commented <q>quote</q> in a paragraph.</p> *-->
<!--* This is a comment with a PI <?MyApp bg:RED fg:black    *-->
<!--* date > today                                           *-->
<!--* This comment has ?&gt; a pseudo-close in in it.        *-->

These are illegal under both rules:

<!--* Comments cannot nest, so this comment
does not succeed in commenting out the entire
GREETING element:
<greeting>
Hello, world!
<!--* Comments can contain single hyphens - like this *-->
</greeting>
*-->

<! --* This is a bad comment. *-->
<!--* This is bad XML, though legal SGML. *-- >

And this should be legal under the future rule, but illegal for now.

<!--* Comments cannot contain double hyphens -- this is illegal --
      for now *-->


-C. M. Sperberg-McQueen

Received on Thursday, 30 January 1997 12:17:59 UTC