Re: A7: CDATA, RCDATA, TEMP marked sections? from Paul Prescod on 1996-10-09 (w3c-sgml-wg@w3.org from October 1996)

From: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
Date: Tue, 8 Oct 1996 21:13:21 -0400 (EDT)
To: peter@sqwest.bc.ca (Peter Sharpe)
Cc: w3c-sgml-wg@w3.org
Message-Id: <199610090113.VAA27042@calum.csclub.uwaterloo.ca>
> On Oct 4,  5:36pm, Paul Prescod wrote:
> > >A.7 Should XML have CDATA, RCDATA, and TEMP marked sections or not?
> >
> > It would be really handy to have some mechanism, to allow arbitrary non-SGML
> > data (in the same character encoding).
> >
> There are several requirements for the mechanism by which the markup is
> escaped:
> 1. It has to be simple and intuitive.
>    I strongly believe that CDATA marked sections violate this requirement.

I have to disagree. The hardwired CDATA MS syntax somebody (Michael?) proposed 
a few days ago is not that non-intuitive. I originally had an idea like yours,
but have come to embrace that convention.

>    HTML authors are used to having both syntax and symantics for their
>    markup. If they use SCRIPT, I believe they would naturally expect the
>    "parser" to understand that it should ignore everything until it sees
>    "</SCRIPT>". To have to add additional markup would neither be intuitive
>    nor welcomed.

No, but they are already going to have to learn to "shape up" to move into
the XML world. If you look at it in a certain way, you might consider it
_easier_ for both the author and the parser to have a SINGLE syntax for
turning on and off CDATA content instead of a potentially infinite list
(<SCRIPT>, <CODE>, <STYLE>, ... ). I don't think that hard-coded CDATA
marked sections are harder to understand or to parse.

And what if a user wanted to include some "SGML code" in the middle of 
a paragraph or somewhere else in an element that is not CDATA? I think that 
a modeless, always-available mechanism for marking CDATA content is 
preferable to a DTD specific one.

I also think that documents encoded in this manner are more robust than 
those that depend on CDATA declarations in a DTD that may or may not be
available and may or may not change.  I am strongly in favour of 
de-emphasizing DTDs and application conventions in the reliable parsing of 
XML documents. (which is why I strongly oppose RE proposals that "leave
it up to the application").

> I do not believe that there is an acceptable solution to these requirements
> using SGML. The choices are very few: CDATA elements, CDATA marked sections
> and "structured comments". CDATA elements fail to hide markup which looks
> like end-tags. CDATA marked sections are too much of a burden. And
> "structured comments"...well, that's the worst kind of hack, in my opinion.
> 
> I do believe that there is a fairly simple solution that would cover almost
> all cases, and the cases it doesn't cover would be obvious to the author:
> Proposal: The only markup which terminates the content of a CDATA element
> is an end-tag that matches the element's start-tag. For example, the only
> markup that would end a SCRIPT element would be "</SCRIPT>".
> 
> In the case where there is no DTD, there either would be no possibility of
> CDATA elements or else there would be some alternate way to indicate the
> content type.

I think that our design should presume the non-availability of a DTD as the 
"norm". So we should specify that "alternate way", and I suspect it will
turn out to be hard-coded CDATA marked sections.

I had the original non-SGML compatible proposal in this area (and it was
similar to yours). I now feel that we should settle for a reasonable, workable
SGML-compatible compromise: CDATA marked sections. Perhaps for SGML 97 we
can get soething more flexible (i.e. something that would allow us to
embed ]]).

 Paul Prescod
Received on Tuesday, 8 October 1996 21:14:29 UTC