Re: "<" and ">" within text nodes

A text node can and does contain arbitrary text within the valid characters
allowed by the xml standard (character 0, for example, is not allowed).

During output, a text node may encode <, >, and & as &lt;, &gt;, and &amp;.  
This is not ad-hoc.  This is how it is done.  Likewise a CDATASection may
contain arbitrary text, and may encode "]]>" as "]]><![CDATA[>".

There is more than one way to encode this, but I think the above is normal.
Additionally, within an attribute value quoted with double quotes, you
escape the quote as &quot;.  An attribute may not contain naked syntax that
looks like tags, so these are still escaped, too.  Actually, the reason for
escaping &gt; seems to be to avoid placing "]]>" anywhere in the document,
since that is legal ONLY as the termination of a CDATASection.  And in an 
entity declaration you generally quote using single quote so that attributes 
appearing inside embedded elements can be quoted with double quotes, which
means that single quote needs to be escaped as &apos; here, too, and you 
also seem to need to escape percent as a character reference so that it is 
not confused with parameter entity references which may occur in the value
of an entity declaration.

I could have missed something here, but I think this is how it goes.  I
can not tell you for certain that SAX will interpret it the same, but
DOM expects these entity and character references to be converted
into characters, so that is how you round-trip in DOM.

Speaking for myself only,

Ray Whitmer
ray@imall.com

On Fri, 16 Apr 1999, Larry Watanabe wrote:

>
>Text nodes cannot contain  arbitrary text; in particular "<" and ">" will
>cause SAX parse errors when the node is read back in. It is possible to
>enclose this text withn a CDATA spection, but then there is the equivalent
>problem with the CDATA terminator. In addition, CDATA may be undesirable for
>other reasons (e.g. external requirements).
>
>These characters can be encoded as "&lt" and "&gt", which also requires that
>"&" be encoded as "&amp". However, this seems like a) an ad hoc solution,
>and b) something which has probably already been solved. 
>
>Q: Does anyone know of a general encoding routine for encoding the text
>within a Text node that 
>
>	a) preserves information; the same text read in by a SAX parser will
>be converted to the correct characters without the use of a special decoding
>routine?
>	b) handles all other cases besides "<" and ">" if there are any?
>
>Thank you.
>
>-Larry Watanabe  lwatanab@jetform.com
>
>

Received on Friday, 16 April 1999 13:15:00 UTC