Re: marked sections from Steven J. DeRose on 1996-09-16 (w3c-sgml-wg@w3.org from September 1996)

From: Steven J. DeRose <sjd@ebt.com>
Date: Mon, 16 Sep 1996 14:24:29 -0400
To: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <2.2.32.19960916182429.00977258@kirk.ebt.com>

At 02:51 PM 09/13/96 CDT, Michael Sperberg-McQueen wrote:
...
>It seems essential that we have ways of representing delimiter
>strings in XML without having them parsed as delimiters.  Off the top

<summary source=cmsmcq resp=sjd>
  1 CDATA marked sections for examples: <![ CDATA [ &hi; <p> ]]>
  2 Entity references: &amp;hi; &lt;p>
  3 Empty comments: <<!>P> or <<?>P> or <&nil;P> or <<![IGNORE[]]>P>
  4 CDATA or RCDATA elements:  <XMP> <P> </XMP>
  5 External CDATA, SDATA, or NDATA entities: &my-escaped-phrase;
  6 Shortrefs:  \&ht; \<P> 
</summary>

>Are there other techniques people have seen?

I haven't seen 6 actually used, though it has some nice features. Doing this
of course does not mean XML needs to support SHORTREF; XML merely needs to
define "\<" and "\&" in the XML grammar; SGML could then achieve
compatibility by using SHORTREF in its domain.

Another option is to set MSSCHAR in your SGML declaration, say, to '\'. It
works basically like backslash in C: the following character is not
recognized as starting any delimiter. Unfortunately, I have been given to
understand that it differs in one crucial respect: the backslash (or other
MSSCHAR character) remains in the parsed result as a data character. This is
not so bad as long as you can configure your downstream processor
(browser/indexer/etc) to get rid of it later. 

On the other hand, the fact that < and </ are not recognized before a
whitespace character (due to contextual constraints), means that you can
just insert a blank space after them to escape them, and have a formatting
option (or requirement) that a single space is always discarded after < or
</. This would live outside the domain of the parser/formal language
entirely, but would be pretty trivial, and pretty easy to explain.

>
>C.  Possible Approaches
>
>What are our options for XML?
>
>We could keep marked sections on the grounds that they are useful and
>not too hard to parse.  We could simplify the parsing a bit, e.g. by
>insisting that CDATA and RCDATA be literals, not entity references,
>-- or do people switch between CDATA and IGNORE?!

I hope note, since if one switches between CDATA and IGNORE you invite
disaster, since IGNORE marked sections can nest but CDATA can't. Thus, if
you set %flag to IGNORE, this works:

   <![ %flag; [ 
      ... 
      <![ IGNORE [ hello ]]>
      there AT&T
   ]]>

but if you change %flag to CDATA, the company name will produce a syntax
error (or reference the "T" entity, or the #DEFAULT entity...). If no entity
reference happens to be around, then SGML recognizes the (unmatched) ]]> on
the last line, but it is not considered an error that it was found with no
marked section still open for it to close, so it is effectively ignored.

Such behaviors also effect the "5-line Perl script writer" Jon has mentioned....

>
>If we want to lose marked sections, we need to say how to get the
>required function.  For CDATA sections, I think either of techniques
>2 or 3 above would work, though I suspect someone will want to drop
>the empty comment from the language, leaving us with technique 2 (sigh).

I think I vote for the space convention or the shortrefs. But I'm not sure.

Steve

Received on Monday, 16 September 1996 14:26:32 UTC