- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Thu, 15 Sep 2005 11:57:58 +0900
- To: public-i18n-its@w3.org
Hello Tim, others, I think the changes made by Tim are very good. However, they still seem to ignore some issues. In particular: CDATA marked sections are syntactic sugar. They are just a way of escaping, same as numeric character references (NCRs). Any downstream tool can remove the CDATA marked section if it converts the contents accordingly, without changing the meaning of the document at all. As an example, Canonical XML just does away with CDATA sections (see http://www.w3.org/TR/2001/REC-xml-c14n-20010315.html#Terminology). Given this, phrases such as "impossible to know the intended use of the contents of a CDATA section" don't make sense. It's as impossible as knowing the intended use of a non-marked-up sentence within a paragraph. Turned around, for translators and tools, it is always easy to know the intended use of a CDATA section: no specific intent (except maybe for readability when compared with other forms of escaping). Regarding the use of CDATA sections for including other formats, there are very much the same issues when using NCRs. Two basic uses have to be distinguished: 1) "citing", i.e. use of the source form e.g. for educational purposes. In that case, use of escaping (be it NCRs, character entities, or CDATA sections) is the right thing to do. If additional i18n/localization markup is needed, that can be added easily (if necessary after converting for CDATA section to another form of escaping), at least as easily as for any not yet marked-up text. An example: Assume we look at a book teaching XHTML, an example for <img>: <p>And here is how to include an image in XHTML: <example><![CDATA[<img src='the URI of the image' alt='an alternative text for people not able to view the image' />]]></example> If we want to add translatability info, we can easily do this as follows: <p>And here is how to include an image in XHTML: <example translate='no'><img src='<span translate='yes'>the URI of the image</span>' alt='<span translate='yes'>an alternative text for people not able to view the image</span>' /></example> Except for adding translatability info, we haven't changed the document! If the original author claims that we have changed the document by changing the CDATA section to <,..., then (s)he doesn't understand XML. 2) "embedding": This happens when the format in question (HTML, RTF) is made logically part of the 'hosting' document. A very typical case is the use of HTML in blog formats such as (the various variants of) RSS. In that case, the right solution is to do the embedding via namespaces, not via escaping. Newer blog formats (e.g. atom) at least provide a way to do that, although they still allow escaping, unfortunately. For non-XML formats, embedding via namespaces of course doesn't work. One solution is to create an XML version of the format, but that may not be easy, and may not help interoperability. Otherwise, there is only escaping. There may be specific issues for i18n/l10n tagging for embedded formats (e.g. do things such as translatability inherit across embedding boundaries,...), but they are all unrelated to CDATA sections. And again when a format has to be escaped (e.g. RTF), the issues there (how to ideally mark up semantic units of that format rather than pieces of it's syntax) are unrelated to CDATA sections. Please feel free to integrate part or all of the above into the Wiki. Regards, Martin. >Date: Tue, 13 Sep 2005 09:58:34 -0000 >X-W3C-Hub-Spam-Status: No, score=-1.9 >Subject: [ESW Wiki] Update of "its0503ReqCDATA" by TimFoster >X-Archived-At: >http://www.w3.org/mid/20050913095834.31256.93844@localhost.localdomain >Resent-From: public-i18n-its@w3.org >Dear Wiki user, > >You have subscribed to a wiki page or wiki category on "ESW Wiki" for >change notification. > >The following page has been changed by TimFoster: >http://esw.w3.org/topic/its0503ReqCDATA > >The comment on the change is: >changed a lot of the contet, to fit the same model used in other pages > >------------------------------------------------------------------------------ > = CDATA Section = > > >- == Description == >+ == Summary == > >- CDATA sections in XML pose problems to translators and tools authors >- that are similar to the problems posed to other consumers of XML >- documents: that is, that it is impossible to know the intended use of >- the contents of a CDATA section. The use of CDATA sections in >- translatable XML files is strongly discouraged, as they prevent elements >- in the XML ITS from being used to mark up the localisable components of >- that section of text. >+ CDATA sections are discouraged, as their contents cannot easily be marked up >+ by using elements from any proposed internationalization tag set. >+ >+ == Challenges == >+ >+ For translators, and other document consumers, given any section of CDATA, >+ it's difficult to know the intended use of the contents of a CDATA section. >+ >+ The use of CDATA sections in translatable XML files is discouraged, as they >+ prevent any elements in a proposed XML internationalization tag set from being >+ used to mark up the localisable components of that section of text, although >+ the entire CDATA section could be wrapped in additional tags. > > In addition, numeric character references and entity references are not >supported > within CDATA sections, which could lead to a possible loss of data if the >document >- is converted from one encoding to another where some characters in the >CDATA sections >+ is converted from one encoding to another where some characters in >- are not supported. >+ the CDATA sections are not supported. > >- == Background == >+ >+ == Notes == > > There is a temptation to use CDATA sections in XML files to escape > sections of text that contain characters which would otherwise be >@@ -37, +43 @@ > > > A commonly employed example of this has been seen where document authors > attempt to easily produce an "XML version" of an input file by inserting >- CDATA sections around text which contains HTML markup. >+ CDATA sections around text which contains HTML markup. >- Since these escaped sections cannot be marked up using the XML ITS, they >must be >- examined manually to determine which sections contain translatable text, >- non-translatable text, etc. For tools authors, there is often no way to >determine >- the original format of the text inside the CDATA section (eg. was it HTML, RTF, >- a base64-encoded OpenOffice.org document etc.) >- These considerations can result in bottle-necks in >- translation processes while these manual steps are performed. > >+ Since the contents of these escaped sections cannot be marked up using the >+ XML ITS, they must be examined manually to determine which parts of the content >+ contain translatable text, non-translatable text, etc. For tools authors, >there is often >+ no way to determine the original format of the text inside the CDATA section >+ (eg. was it HTML, RTF, a base64-encoded OpenOffice.org document etc.) >+ >+ These considerations can result in bottle-necks in translation processes >while these >+ manual steps are performed. >+
Received on Thursday, 15 September 2005 02:59:20 UTC