Re: "<" and ">" within text nodes from John Cowan on 1999-04-16 (www-dom@w3.org from April to June 1999)

From: John Cowan <cowan@locke.ccil.org>
Date: Fri, 16 Apr 1999 13:00:05 -0400 (EDT)
To: LWatanab@JetForm.com (Larry Watanabe)
Cc: www-dom@w3.org
Message-Id: <199904161700.NAA13389@locke.ccil.org>

Larry Watanabe scripsit:

> These characters can be encoded as "&lt" and "&gt", which also requires that
> "&" be encoded as "&amp". However, this seems like a) an ad hoc solution,
> and b) something which has probably already been solved. 

That *is* the solution.  You cannot just blindly write out a Text node;
you must check for & and < and ]]> and make the correct substitutions,
just as you must watch for characters unrepresentable in the output
charset and write character references (unless you are writing UTF-8
or UTF-16).

> Q: Does anyone know of a general encoding routine for encoding the text
> within a Text node that 
> 
> 	a) preserves information; the same text read in by a SAX parser will
> be converted to the correct characters without the use of a special decoding
> routine?
> 	b) handles all other cases besides "<" and ">" if there are any?

No, but it's easy to concoct one along the lines I mention above.
The hardest part is probably finding out what the current character
repertoire (= set of representable characters) for the output is.

-- 
John Cowan					cowan@ccil.org
		e'osai ko sarji la lojban.

Received on Friday, 16 April 1999 12:57:28 UTC