Against CDATA sections in the Infoset from Francois Yergeau on 2001-01-26 (www-xml-infoset-comments@w3.org from January to March 2001)

From: Francois Yergeau <FYergeau@alis.com>
Date: Fri, 26 Jan 2001 11:09:43 -0500
To: www-xml-infoset-comments@w3.org
Message-ID: <8F23CC69DF9ED411BF6E00010267B0F808F9E3@VOYAGER>
This is to recapitulate the arguments against having CDATA section markers
in the Infoset, an issue on which the WG has not reached consensus in the
Last Call WD.

Before starting with the actual arguments, it is good to remember that the
Infoset spec does not enumerate *all* of the information that can be gleaned
from parsing an XML document.  Appendix D lists 17 kinds of "things" that
are not in the Infoset.  The spec therefore takes a stand, makes a choice
between information that is relevant and irrelevant information that is
purely an artefact of the encoding (in XML) of the relevant information.
This is OK: nobody cares whether single or double quotes surround an
attribute value, it is the value itself that matters.  My position is that
CDATA sections are just as irrelevant and that calling them otherwise in the
Infoset spec would not only be inconsistent but would also be harmful to
some internationalization concerns (details below).

The big picture: CDATA sections are defined in section 2 of the XML 1.0
spec.  Section 3 defines the logical structure (elements and attributes),
whereas section 4 defines the physical structure (entities).  This supports
the interpretation that CDATA sections are part neither of the logical
structure nor of the physical structure; they are just syntactic devices.
This is also supported by the absence of CDATA sections in the second
paragraph of section 2: "Physically, the document is composed of units
called entities. [...] Logically, the document is composed of declarations,
elements, comments, character references, and processing instructions, all
of which..."  Very clearly, CDATA sections do *not* define structure like
elements and entities do.

More specifically: The definition of CDATA sections (in 2.7 of XML 1.0) does
not give any meaning to CDATA sections, other than providing them as a
syntactic construct "to escape blocks of text containing characters which
would otherwise be recognized as markup".  And nothing else in the spec
gives them any other meaning or status, as can be ascertained by searching
for 'CDATA' and reading around all the finds.  CDATA sections are certainly
convenient in certain situations, but since the spec does not give them any
meaning in themselves, they are at the same level as, say, white space
within start tags: pure syntactic sugar.

Some specs "make use" of CDATA sections by recommending their use in certain
situations: XHTML recommends them for scripts and style sheets (see [1] for
a discussion) and the SVG CR requires them (wrongly, see [2]) for style
sheets.  It is noteworthy that these "uses", whether required, recommended
or merely optional, do not create any requirement for CDATA sections to be
in the infoset.  This can be understood by comparing

<html>
 <script><![CDATA[Document.write("<p>Hello!</p>");]]></script>
</html>

with

<html>
 <script>Document.write("&lt;p>Hello!&lt;/p>");</script>
</html>

[note the escaping of < in the second one] Any conforming XML parser will
report that the contents of the <script> element is
'Document.write("<p>Hello!</p>");', irrespective of whether it reports that
it saw a CDATA section or not; the contents of <script> is of course what
matters, i.e. what gets passed to the script processor.  In short, whether
the script (or style sheet) is packaged in a CDATA section is irrelevant to
further processing, it is no more relevant to the infoset than the amount of
white space in a tag such as <br />.

It has been argued that having CDATA sections in the Infoset is necessary
for round-tripping, i.e. reading in a document, producing its infoset and
serializing that infoset to recover the original document. The fact is that
the Infoset does not support round-tripping; numerous pieces of syntactic
"info" are left out and it is impossible to exactly reconstruct the document
from its infoset, whether CDATA sections are represented or not.

After all this argumentation about the absence of reasons for including
CDATA sections in the infoset, there comes a reason for *not* having them:
they are harmful to i18n. The problem is that one cannot have character or
entity references in a CDATA section, while those references may be
necessary when serializing a document in a character encoding not containing
some of the characters in the document (see [3] for more details).  In those
situations it is important that the "serializer" (whatever that is) not be
constrained to preserve an existing CDATA section, that it be free to not
use one or to split one into two and pluck a character reference in between,
just as it is free to choose what kind of quotes around attribute values.
This implies that CDATA sections must not be given a meaning which they
don't have.

Many people have an "intuitive" feeling that CDATA sections are more that
syntactic sugar, perhaps because of their having start and end "tags" much
like elements.  This is not supported by the XML spec in any way, yet the
belief itself seems to be pretty widespread.  As shown above, however,
ascribing *any* meaning to CDATA sections that would make them worthy of
being preserved is harmful to i18n, so it is important for the Infoset spec
to not encourage it, even discourage it, by being consistent in not
containing them, just like it doesn't contain the other purely syntactic
details.

In summary:
- There is no support for any meaning for CDATA sections; they don't
participate in either the logical or physical structure of a document and
their presence does not carry any relevant information.
- Even the specs that "use" them have no need for them in the Infoset. It
suffices that they are usable when writing out (serializing) a document,
which is of course the case.
- Ascribing them a meaning would be harmful for i18n.
- Having them in the Infoset would be inconsistent with the other choices
made of relevant information and would encourage the misguided belief that
they are anything else than a syntactic device.

--
François Yergeau 

[1] http://lists.w3.org/Archives/Member/w3c-xml-core-wg/2001JanMar/0075.html
[2] http://lists.w3.org/Archives/Member/svg-comments/2001JanMar/0009.html
[3]
http://www.w3.org/International/Group/1998/12/NOTE-i18n-rev-xml-19981221#CDA
TA
Received on Friday, 26 January 2001 11:16:22 UTC