CDATA is syntactic sugar (was: Re: [ESW Wiki] Update of "its0503ReqCDATA" by TimFoster) from Martin Duerst on 2005-09-15 (public-i18n-its@w3.org from July to September 2005)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 15 Sep 2005 11:57:58 +0900
To: public-i18n-its@w3.org
Message-Id: <6.0.0.20.2.20050915111641.07732dc0@localhost>
Hello Tim, others,

I think the changes made by Tim are very good. However, they
still seem to ignore some issues. In particular:

CDATA marked sections are syntactic sugar. They are just a way of
escaping, same as numeric character references (NCRs). Any downstream
tool can remove the CDATA marked section if it converts the contents
accordingly, without changing the meaning of the document at all.
As an example, Canonical XML just does away with CDATA sections
(see http://www.w3.org/TR/2001/REC-xml-c14n-20010315.html#Terminology).

Given this, phrases such as "impossible to know the intended use of
the contents of a CDATA section" don't make sense. It's as impossible
as knowing the intended use of a non-marked-up sentence within a paragraph.
Turned around, for translators and tools, it is always easy to know the
intended use of a CDATA section: no specific intent (except maybe for
readability when compared with other forms of escaping).

Regarding the use of CDATA sections for including other formats,
there are very much the same issues when using NCRs. Two basic uses
have to be distinguished:
1) "citing", i.e. use of the source form e.g. for educational purposes.
    In that case, use of escaping (be it NCRs, character entities, or
    CDATA sections) is the right thing to do. If additional i18n/localization
    markup is needed, that can be added easily (if necessary after
    converting for CDATA section to another form of escaping), at least
    as easily as for any not yet marked-up text.

    An example:
    Assume we look at a book teaching XHTML, an example for <img>:
       <p>And here is how to include an image in XHTML:
       <example><![CDATA[<img
         src='the URI of the image'
         alt='an alternative text for people not
              able to view the image' />]]></example>
    If we want to add translatability info, we can easily do this as follows:
       <p>And here is how to include an image in XHTML:
       <example translate='no'>&lt;img
         src='<span translate='yes'>the URI of the image</span>'
         alt='<span translate='yes'>an alternative text for people not
              able to view the image</span>' /></example>

    Except for adding translatability info, we haven't changed the document!
    If the original author claims that we have changed the document by changing
    the CDATA section to &lt;,..., then (s)he doesn't understand XML.

2) "embedding": This happens when the format in question (HTML, RTF) is
    made logically part of the 'hosting' document. A very typical case is
    the use of HTML in blog formats such as (the various variants of) RSS.
    In that case, the right solution is to do the embedding via namespaces,
    not via escaping. Newer blog formats (e.g. atom) at least provide a way
    to do that, although they still allow escaping, unfortunately.
    For non-XML formats, embedding via namespaces of course doesn't work.
    One solution is to create an XML version of the format, but that may
    not be easy, and may not help interoperability. Otherwise, there is
    only escaping.
    There may be specific issues for i18n/l10n tagging for embedded
    formats (e.g. do things such as translatability inherit across
    embedding boundaries,...), but they are all unrelated to CDATA sections.
    And again when a format has to be escaped (e.g. RTF), the issues
    there (how to ideally mark up semantic units of that format rather
    than pieces of it's syntax) are unrelated to CDATA sections.

Please feel free to integrate part or all of the above into the Wiki.

Regards,    Martin.

 >Date: Tue, 13 Sep 2005 09:58:34 -0000
 >X-W3C-Hub-Spam-Status: No, score=-1.9
 >Subject: [ESW Wiki] Update of "its0503ReqCDATA" by TimFoster
 >X-Archived-At:
 >http://www.w3.org/mid/20050913095834.31256.93844@localhost.localdomain
 >Resent-From: public-i18n-its@w3.org

 >Dear Wiki user,
 >
 >You have subscribed to a wiki page or wiki category on "ESW Wiki" for
 >change notification.
 >
 >The following page has been changed by TimFoster:
 >http://esw.w3.org/topic/its0503ReqCDATA
 >
 >The comment on the change is:
 >changed a lot of the contet, to fit the same model used in other pages
 >
 >------------------------------------------------------------------------------
 >  = CDATA Section =
 >
 >
 >- == Description ==
 >+ == Summary ==
 >
 >- CDATA sections in XML pose problems to translators and tools authors
 >- that are similar to the problems posed to other consumers of XML
 >- documents: that is, that it is impossible to know the intended use of
 >- the contents of a CDATA section. The use of CDATA sections in
 >- translatable XML files is strongly discouraged, as they prevent elements
 >- in the XML ITS from being used to mark up the localisable components of
 >- that section of text.
 >+ CDATA sections are discouraged, as their contents cannot easily be marked up
 >+ by using elements from any proposed internationalization tag set.
 >+
 >+ == Challenges ==
 >+
 >+ For translators, and other document consumers, given any section of CDATA,
 >+ it's difficult to know the intended use of the contents of a CDATA section.
 >+
 >+ The use of CDATA sections in translatable XML files is discouraged, as they
 >+ prevent any elements in a proposed XML internationalization tag set from being
 >+ used to mark up the localisable components of that section of text, although
 >+ the entire CDATA section could be wrapped in additional tags.
 >
 >  In addition, numeric character references and entity references are not
 >supported
 >  within CDATA sections, which could lead to a possible loss of data if the
 >document
 >- is converted from one encoding to another where some characters in the
 >CDATA sections
 >+ is converted from one encoding to another where some characters in
 >- are not supported.
 >+ the CDATA sections are not supported.
 >
 >- == Background ==
 >+
 >+ == Notes ==
 >
 >  There is a temptation to use CDATA sections in XML files to escape
 >  sections of text that contain characters which would otherwise be
 >@@ -37, +43 @@
 >
 >
 >  A commonly employed example of this has been seen where document authors
 >  attempt to easily produce an "XML version" of an input file by inserting
 >- CDATA sections around text which contains HTML markup.
 >+ CDATA sections around text which contains HTML markup.
 >- Since these escaped sections cannot be marked up using the XML ITS, they
 >must be
 >- examined manually to determine which sections contain translatable text,
 >- non-translatable text, etc. For tools authors, there is often no way to
 >determine
 >- the original format of the text inside the CDATA section (eg. was it 
HTML, RTF,
 >- a base64-encoded OpenOffice.org document etc.)
 >- These considerations can result in bottle-necks in
 >- translation processes while these manual steps are performed.
 >
 >+ Since the contents of these escaped sections cannot be marked up using the
 >+ XML ITS, they must be examined manually to determine which parts of the 
content
 >+ contain translatable text, non-translatable text, etc. For tools authors,
 >there is often
 >+ no way to determine the original format of the text inside the CDATA section
 >+ (eg. was it HTML, RTF, a base64-encoded OpenOffice.org document etc.)
 >+
 >+ These considerations can result in bottle-necks in translation processes
 >while these
 >+ manual steps are performed.
 >+
Received on Thursday, 15 September 2005 02:59:20 UTC