- From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
- Date: Wed, 29 Jan 97 10:22:18 CST
- To: W3C SGML Working Group <w3c-sgml-wg@www10.w3.org>
On Wed, 29 Jan 1997 10:26:52 -0500 Gavin Nicol said: >Michael has the rule for comments: > > "<!--*"([-*]|("-"("*""-"?)*[-*])|(("*""-"?)+[-*]))*("*"|"-*")+"-->" > >using flex as a test lexer, I tried the following successfully. > > "<!--*"([^*]|("*"[^-])|("*-"[^-])|("*--"[^>]))*"*-->" > >I used the following as test data: Some of which should pass through OK, and some of which should raise errors. > <!--**--> > <!--* I am a comment *--> OK. > <!--* * *- *-- *--> > <!--* * -* --* --*> *--> > <!--* > -- This is a big ugly comment -- > *-- This is a big ugly comment --* > --* This is a big ugly comment *-- > <--* This is a big ugly comment *---> > <--* This is a big ugly comment *-> > *--> All errors: '--' os prohibited within XML comments, for compatibility with SGML, and even after the TC passes, which will (in its current form) allow for distinct como and comc delimiters (--* and *--), the string *-- will, as Chris Maden points out, remain illegal for compatibility reasons. Terry Allen asks for a rationale. What follows is mine; it's not inconsistent with what others on the ERB think, that I know of, but it goes well beyond what I know them to believe. I think it's fair to say the arguments against the change were primarily of the "why change it? it's not broken" variety: any change away from the reference concrete syntax may confuse people who are used to the RCS. No arguments were made in favor of the '--' toggle on its merits. The arguments in favor of changing the comment syntax were that a prohibition on '--' in comments which are *not* ended by '--' makes no sense and *is* clearly broken as designed. And since it's broken solely for reasons of compatibility, the question arises "What changes to SGML might remove this problem?" The discussion of this question seemed to make clear that the preferred solution would be to distinguish comment-open (como) from comment-close (comc) delimiters; this would enable XML to have distinct como and comc delimiters, with comc something other than '--'. In particular, Lee Quin's suggestion to use --* and *-- as como and comc seemed particularly good, since it can be (mis)interpreted in terms of the RCS without catastrophic error. So the TC which is expected to go to the appropriate ISO body calls for the introduction of como and comc delimiters. When the TC passes, we want to be able to remove the prohibition on '--' in comments. This is possible only if XML defines strings which can be used, in Full SGML systems, as the como and comc delimiters. So we need to do that, and we need to do it now, now after the TC passes. Of the various candidates for como and comc, --* and *-- won hands down: they are symmetrical, usable now, and not radically different from current practice. After the TC, XML can remove the prohibition on --, though not the prohibition on *--. And there will still be a minor inconvenience in that comments will still be unable to nest, though there is no reason, given distinct open and close delimiters, to forbid nesting of comments, and in practice it would be very convenient to be able to comment out a block of text without having to check first to see if it already contained any comments. (An IGNORE marked section can be used this way, but I persist in the belief that marked sections are not comments, and using them this way constitutes markup abuse.) Finally, it should be pointed out that although the deterministic regular expression for comments is complicated, the comment structure itself is very simple and easy to describe: Comments begin with '<!--*' and end with '*-->'. Within a comment, anything is legal except '--'. After the TC, these rules can change (and will, unless the ERB rescinds its decision of earlier this month) to: Comments begin with '<!--*' and end with '*-->'. Within a comment, anything is legal except '*--'. And at that point, the regular expression in the document can presumably change to something more like Gavin's proposal above (though not *exactly* the same, since Gavin allows '*--' if not followed by '>', which XML will have to continue to forbid, for compatibility reasons. I make the new rule out to be: "<!--*"([^*]|("*"[^-])|("*-"[^-]))*"*-->" Since *-- is much rarer in normal text than --, which in many people's ingrained habit is the expression of an em-dash, XML will be made perceptibly more user-friendly when the change is made. To make the change, however, it's necessary to change the XML comment delimiters. This was enough to persuade the ERB, including myself, that the change was a necessary and good idea. I hope it persuades you, too. -Michael
Received on Wednesday, 29 January 1997 12:08:33 UTC