- From: Francois Yergeau <FYergeau@alis.com>
- Date: Fri, 26 Jan 2001 11:09:43 -0500
- To: www-xml-infoset-comments@w3.org
This is to recapitulate the arguments against having CDATA section markers in the Infoset, an issue on which the WG has not reached consensus in the Last Call WD. Before starting with the actual arguments, it is good to remember that the Infoset spec does not enumerate *all* of the information that can be gleaned from parsing an XML document. Appendix D lists 17 kinds of "things" that are not in the Infoset. The spec therefore takes a stand, makes a choice between information that is relevant and irrelevant information that is purely an artefact of the encoding (in XML) of the relevant information. This is OK: nobody cares whether single or double quotes surround an attribute value, it is the value itself that matters. My position is that CDATA sections are just as irrelevant and that calling them otherwise in the Infoset spec would not only be inconsistent but would also be harmful to some internationalization concerns (details below). The big picture: CDATA sections are defined in section 2 of the XML 1.0 spec. Section 3 defines the logical structure (elements and attributes), whereas section 4 defines the physical structure (entities). This supports the interpretation that CDATA sections are part neither of the logical structure nor of the physical structure; they are just syntactic devices. This is also supported by the absence of CDATA sections in the second paragraph of section 2: "Physically, the document is composed of units called entities. [...] Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which..." Very clearly, CDATA sections do *not* define structure like elements and entities do. More specifically: The definition of CDATA sections (in 2.7 of XML 1.0) does not give any meaning to CDATA sections, other than providing them as a syntactic construct "to escape blocks of text containing characters which would otherwise be recognized as markup". And nothing else in the spec gives them any other meaning or status, as can be ascertained by searching for 'CDATA' and reading around all the finds. CDATA sections are certainly convenient in certain situations, but since the spec does not give them any meaning in themselves, they are at the same level as, say, white space within start tags: pure syntactic sugar. Some specs "make use" of CDATA sections by recommending their use in certain situations: XHTML recommends them for scripts and style sheets (see [1] for a discussion) and the SVG CR requires them (wrongly, see [2]) for style sheets. It is noteworthy that these "uses", whether required, recommended or merely optional, do not create any requirement for CDATA sections to be in the infoset. This can be understood by comparing <html> <script><![CDATA[Document.write("<p>Hello!</p>");]]></script> </html> with <html> <script>Document.write("<p>Hello!</p>");</script> </html> [note the escaping of < in the second one] Any conforming XML parser will report that the contents of the <script> element is 'Document.write("<p>Hello!</p>");', irrespective of whether it reports that it saw a CDATA section or not; the contents of <script> is of course what matters, i.e. what gets passed to the script processor. In short, whether the script (or style sheet) is packaged in a CDATA section is irrelevant to further processing, it is no more relevant to the infoset than the amount of white space in a tag such as <br />. It has been argued that having CDATA sections in the Infoset is necessary for round-tripping, i.e. reading in a document, producing its infoset and serializing that infoset to recover the original document. The fact is that the Infoset does not support round-tripping; numerous pieces of syntactic "info" are left out and it is impossible to exactly reconstruct the document from its infoset, whether CDATA sections are represented or not. After all this argumentation about the absence of reasons for including CDATA sections in the infoset, there comes a reason for *not* having them: they are harmful to i18n. The problem is that one cannot have character or entity references in a CDATA section, while those references may be necessary when serializing a document in a character encoding not containing some of the characters in the document (see [3] for more details). In those situations it is important that the "serializer" (whatever that is) not be constrained to preserve an existing CDATA section, that it be free to not use one or to split one into two and pluck a character reference in between, just as it is free to choose what kind of quotes around attribute values. This implies that CDATA sections must not be given a meaning which they don't have. Many people have an "intuitive" feeling that CDATA sections are more that syntactic sugar, perhaps because of their having start and end "tags" much like elements. This is not supported by the XML spec in any way, yet the belief itself seems to be pretty widespread. As shown above, however, ascribing *any* meaning to CDATA sections that would make them worthy of being preserved is harmful to i18n, so it is important for the Infoset spec to not encourage it, even discourage it, by being consistent in not containing them, just like it doesn't contain the other purely syntactic details. In summary: - There is no support for any meaning for CDATA sections; they don't participate in either the logical or physical structure of a document and their presence does not carry any relevant information. - Even the specs that "use" them have no need for them in the Infoset. It suffices that they are usable when writing out (serializing) a document, which is of course the case. - Ascribing them a meaning would be harmful for i18n. - Having them in the Infoset would be inconsistent with the other choices made of relevant information and would encourage the misguided belief that they are anything else than a syntactic device. -- François Yergeau [1] http://lists.w3.org/Archives/Member/w3c-xml-core-wg/2001JanMar/0075.html [2] http://lists.w3.org/Archives/Member/svg-comments/2001JanMar/0009.html [3] http://www.w3.org/International/Group/1998/12/NOTE-i18n-rev-xml-19981221#CDA TA
Received on Friday, 26 January 2001 11:16:22 UTC