- From: Michael Goldberg <MGoldberg@yet2.com>
- Date: Fri, 19 Jan 2001 08:30:53 -0800
- To: "'html-tidy@w3.org'" <html-tidy@w3.org>
All, Here is my simple input file: <html> <title>212</title> <span style="font-family:Symbol;">Ô</span> </html> When I run this through Jtidy, I don't get the results I expect. Looking at the Document returned by Jtidy, it appears that Jtidy is resolving the Ô entity reference. I thought it was supposed to be the trademark symbol (superscript TM). However, the following code: Document doc = tidy.parseDOM( inStreamTidy, outStreamTidy ); Node span = doc.getDocumentElement().getLastChild().getFirstChild().getFirstChild(); System.out.println( "Name: " + span.getNodeName() ); System.out.println( "Value: *" + span.getNodeValue() + "*" ); Prints the following output: Name: #textValue: *Ô* Note that the value is not the trademark symbol. Instead, it is some strange character. I thought I could get around the problem by slightly changing the input. Rather than "Ô", I thought I would outsmart the parser by changing the input to "&#212;". When I run this input through Jtidy, I get the following output: Name: #textValue: *Ô* I thought I was getting closer. However, when I "serialize" the Jtidied document to a file with the following code: OutputFormat format = new OutputFormat( doc ); format.setIndenting( true ); FileOutputStream outStream = new FileOutputStream( "C:\\WINNT\\Profiles\\mgoldberg\\Personal\\Technology Listing Form\\John\\212\\serialout.html" ); XMLSerializer serial = new XMLSerializer( outStream, format ); serial.asDOMSerializer(); serial.serialize( doc.getDocumentElement() ); The output looks like the following: <?xml version="1.0"?> <html> <head> <meta name="generator" content="HTML Tidy, see www.w3.org" /> <title>212</title> </head> <body> <span style="font-family:Symbol;">&#212;</span> </body> </html> Serializing this document seems to have replaced the entity reference for the ampersand back! My desired result is to have the output of the serializer match the input going in to JTidy. I don't want JTidy to resolve the entity reference, nor do I want the serializer to put entity references back in. I can't seem to figure out how to accomplish this. I've tried setting various Jtidy encodings via setCharEncoding(), but that has not worked. Does anyone know how I can prevent the entity references from resolving? Thanks, Michael
Received on Friday, 19 January 2001 11:31:41 UTC