RE: The <!CDATA issue from Martin Duerst on 2005-03-12 (public-i18n-its@w3.org from January to March 2005)

From: Martin Duerst <duerst@w3.org>
Date: Sat, 12 Mar 2005 11:02:57 +0900
To: "Yves Savourel" <ysavourel@translate.com>, <public-i18n-its@w3.org>
Message-Id: <6.0.0.20.2.20050312105229.09610d20@localhost>

Hello Tim, Yves,

This is all good advice that ultimately should go into our
'guidelines' document. But somehow, the most fundamental
point got missed: CDATA sections are just another way of
dealing with escaping. In other words,
    &lt;
and
    <![CDATA[<]]>
are exactly equivalent. More to the point,
    &#x41;
and
    &#65;
and
    <![CDATA[A]]>
and
    A
are all equivalent ways of expressing the character "A".
In other words, CDATA sections are 'syntactic sugar'.

So rather than saying "don't use CDATA sections", we should
say "don't expect CDATA sections to be preserved, they
are on the same level as numeric character references" or
something similar.

This is not cristal clear in the XML Recommendation itself,
but very clear from the Infoset spec, see
http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.character.

Regards,    Martin.


At 06:59 05/03/12, Yves Savourel wrote:
 >
 >Notes at the bottom.
 >
 >> ------------------
 >> Description:
 >>
 >> CDATA sections in XML pose problems to translators and
 >> tools authors that are similar to the problems posed to
 >> other consumers of XML documents : that is, that it is
 >> impossible to know the intended use of the contents of
 >> a CDATA section. The use of CDATA sections in translatable
 >> XML files is strongly discouraged, as they prevent
 >> elements in the XML ITS from being used to mark up the
 >> localisable components of that section of text.
 >>
 >> Background:
 >>
 >> There is a temptation to use CDATA sections in XML files
 >> to escape sections of text that contain characters which
 >> would otherwise be interpreted as XML characters.
 >>
 >> A commonly employed example of this has been seen where
 >> document authors attempt to easily produce an "XML version"
 >> of an input file by inserting CDATA sections around text
 >> which contains HTML markup. Since these escaped sections
 >> cannot be marked up using the XML ITS, they must be
 >> examined manually to determine which sections contain
 >> translatable text, non-translatable text, etc. This can
 >> result in bottle-necks in translation processes while
 >> these manual steps are performed.
 >
 >
 >Looks good to me. Maybe I would add a bit more in the background section.
 >Something about the lack of NCR support in CDATA section.
 >Like:
 >
 >---
 >Another issue is that numeric character references cannot be used within
 >CDATA sections. This opens may lead to a possible loss of
 >data if the document is converted from one encoding to another where the
 >some of the character in the CDATA sections are not
 >supported. While there is very few reasons to use another encoding than
 >UTF-8 for XML documents, localization tasks sometimes
 >require to temporarily work using encodings that do not encompass the whole
 >range of Unicode.
 >---
 >
 >Cheers,
 >-yves

Received on Sunday, 13 March 2005 02:42:29 UTC