- From: Khanna, Anuj (IT) <Anuj.Khanna@morganstanley.com>
- Date: Thu, 03 Mar 2005 07:00:14 +0000
- To: <html-tidy@w3.org>
- Message-ID: <061702674E832746AA5CBC50E6DE92A301BE382D@NYWEXMB27.msad.ms.com>
Hi everybody, I am using JTidy. I want to replace the Non-Ascii charaters like(£ etc...) with their corresponding entity references since i insert all the returned HTML cleaned string into XML. I am successfully able to do this on Windows platforms. JTidy replaces all the special charaters with their entity references, However when i try the same code on Unix platform the result is not desirable i get all ??????? characters in place of all the special characters. I don't why is this happening. Can anybody suggest why it is behaving this way. I am using follwing code import org.w3c.tidy.*; class TestJTidy { public static void main(String[] args) { System.out.println(cleanData("!\"#$%&'()*+,-./-0123456789:;<=>?@ABCDEFGHIJKLMNO-PQRSTUVWXYZ[\\]^_`abcdefghijklmno-pqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ-‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯-°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ-ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîï-ðñòóôõö÷øùúûüýþÿ")); //System.out.println(encodeHTMLEntitiesUTF8("Testing £this is space test")); } public static String cleanData(String fname) { java.io.ByteArrayOutputStream bos = new java.io.ByteArrayOutputStream(); try{ java.io.FileOutputStream fos = new java.io.FileOutputStream("C:\\Out.xml"); Tidy td = new Tidy(); td.setXmlTags(false); td.setDocType("omit"); td.setTidyMark(false); td.setNumEntities(true); td.setEncloseText(true); /*td.setWord2000(true); td.setDropEmptyParas(true); td.setDropFontTags(true);*/ td.setXmlOut(true); td.parse(new java.io.ByteArrayInputStream(fname.getBytes()),bos).toString(); td.parse(new java.io.ByteArrayInputStream(fname.getBytes()),fos); }catch(Exception e){ e.printStackTrace(); } return bos.toString(); } FOR WINDOWS platform: i get following output <html> <head> <title/> </head> <body> <p> !"#$%&'()*+,-./-0123456789:;<=>?@ABCDEFGHIJKLMNO-PQRSTUVWXYZ[\]^_`abcdefghijklmno-pqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ-‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬­®¯-°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ-ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîï-ðñòóôõö÷øùúûüýþÿ</p> </body> </html> However for UNIX PLATFORM i get the following output <html>
<head>
<title/>
</head>
<body>
<p>
!\"#$%&<>'()*+,-./-0123456789:;=?    
test<br/>
@ABCDEFGHIJKLMNO-PQRSTUVWXYZ[\\]^_<br/>
`abcdefghijklmno-pqrstuvwxyz{|}~<br/>
?????????????-??????????????<br/>
????????????­??-????????????????<br/>
????????????????-????????????????<br/>
????????????????-????????????????</p>
</body>
</html> Any help on this would be appreciated -------------------------------------------------------- NOTICE: If received in error, please destroy and notify sender. Sender does not waive confidentiality or privilege, and use is prohibited.
Received on Wednesday, 23 March 2005 04:21:22 UTC