W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2001

Resolving Entities

From: Michael Goldberg <MGoldberg@yet2.com>
Date: Fri, 19 Jan 2001 08:30:53 -0800
Message-ID: <D090FE9586C9D4119E4E00A02493157A052626@ferris.tahoe.yet2.com>
To: "'html-tidy@w3.org'" <html-tidy@w3.org>
All,

Here is my simple input file:

		<html>
		<title>212</title>
		<span style="font-family:Symbol;">&#212;</span>
		</html>

When I run this through Jtidy, I don't get the results I expect.  Looking at
the Document returned by Jtidy, it appears that Jtidy is resolving the
&#212; entity reference.  I thought it was supposed to be the trademark
symbol (superscript TM).  However, the following code:

		 Document doc = tidy.parseDOM( inStreamTidy, outStreamTidy
);

		 Node span =
doc.getDocumentElement().getLastChild().getFirstChild().getFirstChild();
		 System.out.println( "Name: " + span.getNodeName() );
		 System.out.println( "Value: *" + span.getNodeValue() + "*"
);

Prints the following output:

		Name: #textValue: **
Note that the value is not the trademark symbol.  Instead, it is some
strange character.  I thought I could get around the problem by slightly
changing the input.  Rather than "&#212;", I thought I would outsmart the
parser by changing the input to "&amp;#212;".  When I run this input through
Jtidy, I get the following output:

		Name: #textValue: *&#212;*

I thought I was getting closer.  However, when I "serialize" the Jtidied
document to a file with the following code:

		OutputFormat format = new OutputFormat( doc );
		format.setIndenting( true );
		FileOutputStream outStream = new FileOutputStream(
"C:\\WINNT\\Profiles\\mgoldberg\\Personal\\Technology Listing
Form\\John\\212\\serialout.html" );
		XMLSerializer serial = new XMLSerializer( outStream, format
);
		serial.asDOMSerializer();
		serial.serialize( doc.getDocumentElement() );

The output looks like the following:

		<?xml version="1.0"?>
		<html>
		<head>
		<meta name="generator" content="HTML Tidy, see www.w3.org"
/>
		<title>212</title>
		</head>
		<body>
		<span style="font-family:Symbol;">&amp;#212;</span> 
		</body>
		</html>

Serializing this document seems to have replaced the entity reference for
the ampersand back!  My desired result is to have the output of the
serializer match the input going in to JTidy.  I don't want JTidy to resolve
the entity reference, nor do I want the serializer to put entity references
back in.  I can't seem to figure out how to accomplish this.  I've tried
setting various Jtidy encodings via setCharEncoding(), but that has not
worked.

Does anyone know how I can prevent the entity references from resolving?

Thanks,
Michael
Received on Friday, 19 January 2001 11:31:41 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:45 GMT