W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2001

Resolving Entities

From: Michael Goldberg <MGoldberg@yet2.com>
Date: Fri, 19 Jan 2001 08:30:53 -0800
Message-ID: <D090FE9586C9D4119E4E00A02493157A052626@ferris.tahoe.yet2.com>
To: "'html-tidy@w3.org'" <html-tidy@w3.org>

Here is my simple input file:

		<span style="font-family:Symbol;">&#212;</span>

When I run this through Jtidy, I don't get the results I expect.  Looking at
the Document returned by Jtidy, it appears that Jtidy is resolving the
&#212; entity reference.  I thought it was supposed to be the trademark
symbol (superscript TM).  However, the following code:

		 Document doc = tidy.parseDOM( inStreamTidy, outStreamTidy

		 Node span =
		 System.out.println( "Name: " + span.getNodeName() );
		 System.out.println( "Value: *" + span.getNodeValue() + "*"

Prints the following output:

		Name: #textValue: **
Note that the value is not the trademark symbol.  Instead, it is some
strange character.  I thought I could get around the problem by slightly
changing the input.  Rather than "&#212;", I thought I would outsmart the
parser by changing the input to "&amp;#212;".  When I run this input through
Jtidy, I get the following output:

		Name: #textValue: *&#212;*

I thought I was getting closer.  However, when I "serialize" the Jtidied
document to a file with the following code:

		OutputFormat format = new OutputFormat( doc );
		format.setIndenting( true );
		FileOutputStream outStream = new FileOutputStream(
"C:\\WINNT\\Profiles\\mgoldberg\\Personal\\Technology Listing
Form\\John\\212\\serialout.html" );
		XMLSerializer serial = new XMLSerializer( outStream, format
		serial.serialize( doc.getDocumentElement() );

The output looks like the following:

		<?xml version="1.0"?>
		<meta name="generator" content="HTML Tidy, see www.w3.org"
		<span style="font-family:Symbol;">&amp;#212;</span> 

Serializing this document seems to have replaced the entity reference for
the ampersand back!  My desired result is to have the output of the
serializer match the input going in to JTidy.  I don't want JTidy to resolve
the entity reference, nor do I want the serializer to put entity references
back in.  I can't seem to figure out how to accomplish this.  I've tried
setting various Jtidy encodings via setCharEncoding(), but that has not

Does anyone know how I can prevent the entity references from resolving?

Received on Friday, 19 January 2001 11:31:41 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:49 UTC