- From: Khanna, Anuj (IT) <Anuj.Khanna@morganstanley.com>
- Date: Thu, 03 Mar 2005 07:00:14 +0000
- To: <html-tidy@w3.org>
- Message-ID: <061702674E832746AA5CBC50E6DE92A301BE382D@NYWEXMB27.msad.ms.com>
Hi everybody,
I am using JTidy. I want to replace the Non-Ascii charaters like(£ etc...) with their corresponding entity references since i insert all the returned HTML cleaned string into XML. I am successfully able to do this on Windows platforms. JTidy replaces all the special charaters with their entity references, However when i try the same code on Unix platform the result is not desirable i get all ??????? characters in place of all the special characters. I don't why is this happening. Can anybody suggest why it is behaving this way.
I am using follwing code
import org.w3c.tidy.*;
class TestJTidy
{
public static void main(String[] args)
{
System.out.println(cleanData("!\"#$%&'()*+,-./-0123456789:;<=>?@ABCDEFGHIJKLMNO-PQRSTUVWXYZ[\\]^_`abcdefghijklmno-pqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ-‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯-°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ-ÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîï-ðñòóôõö÷øùúûüýþÿ"));
//System.out.println(encodeHTMLEntitiesUTF8("Testing £this is space test"));
}
public static String cleanData(String fname) {
java.io.ByteArrayOutputStream bos = new java.io.ByteArrayOutputStream();
try{
java.io.FileOutputStream fos = new java.io.FileOutputStream("C:\\Out.xml");
Tidy td = new Tidy();
td.setXmlTags(false);
td.setDocType("omit");
td.setTidyMark(false);
td.setNumEntities(true);
td.setEncloseText(true);
/*td.setWord2000(true);
td.setDropEmptyParas(true);
td.setDropFontTags(true);*/
td.setXmlOut(true);
td.parse(new java.io.ByteArrayInputStream(fname.getBytes()),bos).toString();
td.parse(new java.io.ByteArrayInputStream(fname.getBytes()),fos);
}catch(Exception e){
e.printStackTrace();
}
return bos.toString();
}
FOR WINDOWS platform: i get following output
<html>
<head>
<title/>
</head>
<body>
<p>
!"#$%&'()*+,-./-0123456789:;<=>?@ABCDEFGHIJKLMNO-PQRSTUVWXYZ[\]^_`abcdefghijklmno-pqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ-‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬­®¯-°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ-ÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîï-ðñòóôõö÷øùúûüýþÿ</p>
</body>
</html>
However for UNIX PLATFORM i get the following output
<html>
<head>
<title/>
</head>
<body>
<p>
!\"#$%&<>'()*+,-./-0123456789:;=?    
test<br/>
@ABCDEFGHIJKLMNO-PQRSTUVWXYZ[\\]^_<br/>
`abcdefghijklmno-pqrstuvwxyz{|}~<br/>
?????????????-??????????????<br/>
????????????­??-????????????????<br/>
????????????????-????????????????<br/>
????????????????-????????????????</p>
</body>
</html>
Any help on this would be appreciated
--------------------------------------------------------
NOTICE: If received in error, please destroy and notify sender. Sender does not waive confidentiality or privilege, and use is prohibited.
Received on Wednesday, 23 March 2005 04:21:22 UTC