[ESW Wiki] Update of "its0503ReqCDATA" by TimFoster from w3t-archive+esw-wiki@w3.org on 2005-06-29 (public-i18n-its@w3.org from April to June 2005)

From: <w3t-archive+esw-wiki@w3.org>
Date: Wed, 29 Jun 2005 11:36:19 -0000
To: w3t-archive+esw-wiki@w3.org
Message-ID: <20050629113619.23844.81598@localhost.localdomain>
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "ESW Wiki" for change notification.

The following page has been changed by TimFoster:
http://esw.w3.org/topic/its0503ReqCDATA


The comment on the change is:
Removed the comments, as I think they're all resolved now - comments welcome tho

------------------------------------------------------------------------------
  in the XML ITS from being used to mark up the localisable components of
  that section of text.
  
- '''[TF] additional text:'''
  In addition, numeric character references and entity references are not supported
  within CDATA sections, which could lead to a possible loss of data if the document
  is converted from one encoding to another where some characters in the CDATA sections
  are not supported.
- '''[TF] end additional text '''
- 
- '''[CL] Norman Walsh's [http://norman.walsh.name/2003/09/16/escmarkup] contains pointers to much of the discussion around escaping and CDATA sections. It would be great if we could get him to have a look at the requirement.'''
  
  == Background ==
  
@@ -44, +40 @@

  CDATA sections around text which contains HTML markup. 
  Since these escaped sections cannot be marked up using the XML ITS, they must be
  examined manually to determine which sections contain translatable text,
- non-translatable text, etc. '''[TF] Additional text ''' For tools authors, there is
- often no way to determine the original format of the text inside the CDATA section (eg. was it HTML, RTF, a base64-encoded OpenOffice.org document etc.) 
- These considerations '''[TF] end additional text, removing "This "''' can result in bottle-necks in
+ non-translatable text, etc. For tools authors, there is often no way to determine 
+ the original format of the text inside the CDATA section (eg. was it HTML, RTF,
+ a base64-encoded OpenOffice.org document etc.)
+ These considerations can result in bottle-necks in
  translation processes while these manual steps are performed. 
  
- 
- '''[YS] Maybe we could also mentioned that NCR are not supported in CDATA sections. Something like: ''Numeric character references (NCRs) cannot be used within CDATA sections. This may lead to a possible loss of data if the document is converted from one encoding to another where the some of the characters in the CDATA sections are not supported. While there is very few reasons to use another encoding than UTF-8 for XML documents, localization tasks sometimes require to temporarily work using encodings that do not encompass the whole range of Unicode.'' (The third sentence maybe too much info).'''
- 
- '''[MD] This is all good advice that ultimately should go into our 'guidelines' document. But somehow, the most fundamental point got missed: CDATA sections are just another way of
- dealing with escaping. In other words, &lt; and <![CDATA[<]]> are exactly equivalent. More to the point, &#x41; and &#65; and <![CDATA[A]]> and A are all equivalent ways of expressing the character "A". In other words, CDATA sections are 'syntactic sugar'.
- So rather than saying "don't use CDATA sections", we should say "don't expect CDATA sections to be preserved, they are on the same level as numeric character references" or
- something similar. This is not cristal clear in the XML Recommendation itself,
- but very clear from the Infoset spec, see
- [http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.character].'''
- 
- '''[FS]I think this is also related to the use of general entities - they are not preserved in the infoset either; see also Martin's comments on CDATA in the Wiki and [http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#omitted]'''
- 
- '''[CL] I see a relationship with XLIFF: XLIFF allows you for example to extract/convert sth. like Java properties into XML (namely XLIFF). What you can do in order to avoid CDATA sections is the following: Use the XLIFF representation of your original format. The beauty of this is that XLIFF is on the list of localization tools vendors anyhow.'''
- 
- '''[YS] Thinking more about this: it seems that we have 2 different requirements: one is to not rely on CDATA because of the problems it causes (no support for NCRs, etc.) and the other has to do with codes that are writen as text. They often occur in CDATA, but, as Martin noted, one could also remove the CDATA syntax and still have the problem.'''
- 
- '''[YS] Actually, after some more thinking (and a recent real-life case), I wonder if using CDATA is not wiser if you have HTML codes in the text that are 'seen' as text. (like: '<![CDATA[This is <b>bold</b>]]>' because it let you--in some cases--work around the problem more efficiently. If you replace all the '<![CDATA[' by '<cdata>' (or whatever tag) and ']]>' by '</cdata>', the content becomes parsable by XML filters (if the codes are XHTML) or by HTML parsers (if the codes are HTML) as most HTML filter can process XML input too. After translation we just put back the CDATA notation as it was. We had such a case this week, and we were able to prepare the file in minutes, something that we would not have been able to do if the content would have been 'This is &lt;b>bold&lt;/b>'. Obviously it's only one case, but maybe when we will come up with a recommendation for CDATA we should be careful about our guidelines. Maybe something to keep in a note for this requirement?'''
- 
- '''[MI] Even with such an approach, still the issue of ''...these escaped sections cannot be marked up using the XML ITS...'' is there. I still think that this requirement is purely for a guideline, not for a solution. If that's the case, we should just leave this requirement as it states issues. Then we build detail guidelines (''Don't use'', ''Don't expect'', whatever...) in the recommendation.'''
- 
- '''[TF] My main problem with CDATA sections is that text within the CDATA section can't be separated out as translatable or non-translatable : authors must mark entire sections as <translate><![CDATA[ askdjhaskjdh ]]></translate> or <donttranslate><![CDATA[ askdjhaskjdh ]]></donttranslate> - there's nothing in between, so I've expanded on that point a little. While I understand Martin's point about the syntactic sugar'ness of CDATA, I'm not sure it's relevant to us here, is it (in terms of explaining XML syntax to our audience, there's probably better places for people to learn, right ? I've added text here explaining Yves' point about character set issues : is this okay ?'''
- 
- '''[YS] Yes. The new text looks just fine for me.'''
-
Received on Wednesday, 29 June 2005 13:22:06 UTC