- From: Steven J. DeRose <sjd@ebt.com>
- Date: Mon, 16 Sep 1996 14:24:29 -0400
- To: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
At 02:51 PM 09/13/96 CDT, Michael Sperberg-McQueen wrote: ... >It seems essential that we have ways of representing delimiter >strings in XML without having them parsed as delimiters. Off the top <summary source=cmsmcq resp=sjd> 1 CDATA marked sections for examples: <![ CDATA [ &hi; <p> ]]> 2 Entity references: &hi; <p> 3 Empty comments: <<!>P> or <<?>P> or <&nil;P> or <<![IGNORE[]]>P> 4 CDATA or RCDATA elements: <XMP> <P> </XMP> 5 External CDATA, SDATA, or NDATA entities: &my-escaped-phrase; 6 Shortrefs: \&ht; \<P> </summary> >Are there other techniques people have seen? I haven't seen 6 actually used, though it has some nice features. Doing this of course does not mean XML needs to support SHORTREF; XML merely needs to define "\<" and "\&" in the XML grammar; SGML could then achieve compatibility by using SHORTREF in its domain. Another option is to set MSSCHAR in your SGML declaration, say, to '\'. It works basically like backslash in C: the following character is not recognized as starting any delimiter. Unfortunately, I have been given to understand that it differs in one crucial respect: the backslash (or other MSSCHAR character) remains in the parsed result as a data character. This is not so bad as long as you can configure your downstream processor (browser/indexer/etc) to get rid of it later. On the other hand, the fact that < and </ are not recognized before a whitespace character (due to contextual constraints), means that you can just insert a blank space after them to escape them, and have a formatting option (or requirement) that a single space is always discarded after < or </. This would live outside the domain of the parser/formal language entirely, but would be pretty trivial, and pretty easy to explain. > >C. Possible Approaches > >What are our options for XML? > >We could keep marked sections on the grounds that they are useful and >not too hard to parse. We could simplify the parsing a bit, e.g. by >insisting that CDATA and RCDATA be literals, not entity references, >-- or do people switch between CDATA and IGNORE?! I hope note, since if one switches between CDATA and IGNORE you invite disaster, since IGNORE marked sections can nest but CDATA can't. Thus, if you set %flag to IGNORE, this works: <![ %flag; [ ... <![ IGNORE [ hello ]]> there AT&T ]]> but if you change %flag to CDATA, the company name will produce a syntax error (or reference the "T" entity, or the #DEFAULT entity...). If no entity reference happens to be around, then SGML recognizes the (unmatched) ]]> on the last line, but it is not considered an error that it was found with no marked section still open for it to close, so it is effectively ignored. Such behaviors also effect the "5-line Perl script writer" Jon has mentioned.... > >If we want to lose marked sections, we need to say how to get the >required function. For CDATA sections, I think either of techniques >2 or 3 above would work, though I suspect someone will want to drop >the empty comment from the language, leaving us with technique 2 (sigh). I think I vote for the space convention or the shortrefs. But I'm not sure. Steve
Received on Monday, 16 September 1996 14:26:32 UTC