W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2005

Special Charaters issue with JTidy on Unix

From: Khanna, Anuj (IT) <Anuj.Khanna@morganstanley.com>
Date: Thu, 03 Mar 2005 07:00:14 +0000
Message-ID: <061702674E832746AA5CBC50E6DE92A301BE382D@NYWEXMB27.msad.ms.com>
To: <html-tidy@w3.org>
Hi everybody,
    I am using JTidy. I want to replace the Non-Ascii charaters like(£ etc...) with their corresponding entity references since i insert all the returned HTML cleaned string into XML. I am successfully able to do this on Windows platforms. JTidy replaces all the special charaters with their entity references, However when i try the same code on Unix platform the result is not desirable i get all ??????? characters in place of all the special characters. I don't why is this happening. Can anybody suggest why it is behaving this way.
 
I am using follwing code
 
import org.w3c.tidy.*;
class TestJTidy 
{
 public static void main(String[] args) 
 {
  System.out.println(cleanData("!\"#$%&'()*+,-./-0123456789:;<=>?@ABCDEFGHIJKLMNO-PQRSTUVWXYZ[\\]^_`abcdefghijklmno-pqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ-‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬­®¯-°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ-ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîï-ðñòóôõö÷øùúûüýþÿ"));
  //System.out.println(encodeHTMLEntitiesUTF8("Testing &nbsp; £this is space test"));
  
 
 }
 
 public static String cleanData(String fname) {
 

      java.io.ByteArrayOutputStream bos = new java.io.ByteArrayOutputStream();
  try{
   java.io.FileOutputStream fos = new java.io.FileOutputStream("C:\\Out.xml");
 
   Tidy td = new Tidy();
      td.setXmlTags(false);
   td.setDocType("omit");
   td.setTidyMark(false);
   td.setNumEntities(true);
   td.setEncloseText(true);
      /*td.setWord2000(true);
      td.setDropEmptyParas(true);
      td.setDropFontTags(true);*/
      td.setXmlOut(true);
      td.parse(new java.io.ByteArrayInputStream(fname.getBytes()),bos).toString();
   td.parse(new java.io.ByteArrayInputStream(fname.getBytes()),fos);
  }catch(Exception e){
   e.printStackTrace();
  }
      return bos.toString();
 
  }
 
 
FOR WINDOWS platform: i get following output
 
<html>

<head>

<title/>

</head>

<body>

<p>

!"#$%&amp;'()*+,-./-0123456789:;&lt;=&gt;?@ABCDEFGHIJKLMNO-PQRSTUVWXYZ[\]^_`abcdefghijklmno-pqrstuvwxyz{|}~&#8364;&#8218;&#402;&#8222;&#8230;&#8224;&#8225;&#710;&#8240;&#352;&#8249;&#338;&#381;-&#8216;&#8217;&#8220;&#8221;&#8226;&#8211;&#8212;&#732;&#8482;&#353;&#8250;&#339;&#382;&#376;&#161;&#162;&#163;&#164;&#165;&#166;&#167;&#168;&#169;&#170;&#171;&#172;&#173;&#174;&#175;-&#176;&#177;&#178;&#179;&#180;&#181;&#182;&#183;&#184;&#185;&#186;&#187;&#188;&#189;&#190;&#191;&#192;&#193;&#194;&#195;&#196;&#197;&#198;&#199;&#200;&#201;&#202;&#203;&#204;&#205;&#206;&#207;-&#208;&#209;&#210;&#211;&#212;&#213;&#214;&#215;&#216;&#217;&#218;&#219;&#220;&#221;&#222;&#223;&#224;&#225;&#226;&#227;&#228;&#229;&#230;&#231;&#232;&#233;&#234;&#235;&#236;&#237;&#238;&#239;-&#240;&#241;&#242;&#243;&#244;&#245;&#246;&#247;&#248;&#249;&#250;&#251;&#252;&#253;&#254;&#255;</p>

</body>

</html>

 
 
 
 
 
However for UNIX PLATFORM i get the following output
 
<html>&#x2028;<head>&#x2028;<title/>&#x2028;</head>&#x2028;<body>&#x2028;<p>&#x2028;!\"#$%&amp;&lt;&gt;'()*+,-./-0123456789:;=?&#160;&#160;&#160;&#160;&#x2028;test<br/>&#x2028;@ABCDEFGHIJKLMNO-PQRSTUVWXYZ[\\]^_<br/>&#x2028;`abcdefghijklmno-pqrstuvwxyz{|}~<br/>&#x2028;?????????????-??????????????<br/>&#x2028;????????????&#173;??-????????????????<br/>&#x2028;????????????????-????????????????<br/>&#x2028;????????????????-????????????????</p>&#x2028;</body>&#x2028;</html>

 

 

Any help on this would be appreciated 
--------------------------------------------------------
 
NOTICE: If received in error, please destroy and notify sender.  Sender does not waive confidentiality or privilege, and use is prohibited. 
 
Received on Wednesday, 23 March 2005 04:21:22 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:55 GMT