Reasons for retaining CDATA section boundaries

As requested by the XML Core WG, I am recording some arguments in
favour of retaining CDATA section boundaries in the Infoset.

An application that receives its input from an XML parser will see no
difference between text escaped with character references and text
escaped with CDATA sections.  However, this is not always the
situation, and it may be desirable to preserve CDATA sections in
output.  For example:

 - Text escaped with CDATA sections may be more readable by humans.
   This is especially true for "quoted XML".

 - It may also improve interoperability with non-XML tools.  For
   example, it is entirely reasonable to run "grep" on an XML file, or
   on a directory containing both XML and non-XML files, and a search
   for "AT&T" will not match AT&T.

 - There may even be applications that extract text (such as scripts)
   from XML documents without parsing, on the assumption that the
   relevant text is contained in a CDATA section.

The presence of CDATA section boundaries in the Infoset will encourage
this preservation (though it is not of course required for it).

The argument that it may be impossible to output the text in a CDATA
section in some encodings may be irrelevant to the users in question,
since their editors and other tools may well only work with a
particular encoding anyway.  Certainly many users would find it
unacceptable if running XInclude or XSLT on their documents changed
the encoding.

-- Richard

Received on Wednesday, 31 January 2001 12:08:53 UTC