- From: Ray D. Whitmer <rayw@imall.com>
- Date: Fri, 16 Apr 1999 11:14:30 -0600 (MDT)
- To: Larry Watanabe <LWatanab@JetForm.com>
- cc: www-dom@w3.org
A text node can and does contain arbitrary text within the valid characters allowed by the xml standard (character 0, for example, is not allowed). During output, a text node may encode <, >, and & as <, >, and &. This is not ad-hoc. This is how it is done. Likewise a CDATASection may contain arbitrary text, and may encode "]]>" as "]]><![CDATA[>". There is more than one way to encode this, but I think the above is normal. Additionally, within an attribute value quoted with double quotes, you escape the quote as ". An attribute may not contain naked syntax that looks like tags, so these are still escaped, too. Actually, the reason for escaping > seems to be to avoid placing "]]>" anywhere in the document, since that is legal ONLY as the termination of a CDATASection. And in an entity declaration you generally quote using single quote so that attributes appearing inside embedded elements can be quoted with double quotes, which means that single quote needs to be escaped as ' here, too, and you also seem to need to escape percent as a character reference so that it is not confused with parameter entity references which may occur in the value of an entity declaration. I could have missed something here, but I think this is how it goes. I can not tell you for certain that SAX will interpret it the same, but DOM expects these entity and character references to be converted into characters, so that is how you round-trip in DOM. Speaking for myself only, Ray Whitmer ray@imall.com On Fri, 16 Apr 1999, Larry Watanabe wrote: > >Text nodes cannot contain arbitrary text; in particular "<" and ">" will >cause SAX parse errors when the node is read back in. It is possible to >enclose this text withn a CDATA spection, but then there is the equivalent >problem with the CDATA terminator. In addition, CDATA may be undesirable for >other reasons (e.g. external requirements). > >These characters can be encoded as "<" and ">", which also requires that >"&" be encoded as "&". However, this seems like a) an ad hoc solution, >and b) something which has probably already been solved. > >Q: Does anyone know of a general encoding routine for encoding the text >within a Text node that > > a) preserves information; the same text read in by a SAX parser will >be converted to the correct characters without the use of a special decoding >routine? > b) handles all other cases besides "<" and ">" if there are any? > >Thank you. > >-Larry Watanabe lwatanab@jetform.com > >
Received on Friday, 16 April 1999 13:15:00 UTC