W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2005

Re: Special Charaters issue with JTidy on Unix

From: Fred Bone <Fred.Bone@dial.pipex.com>
Date: Wed, 23 Mar 2005 09:48:39 -0000
To: html-tidy@w3.org
Message-ID: <171DC460B5@Fred.BritishLibrary.net>

On 3 Mar 2005 at 7:00, Khanna, Anuj (IT) said:

>     I am using JTidy. I want to replace the Non-Ascii charaters
> like(£ etc...) with their corresponding entity references since i
> insert all the returned HTML cleaned string into XML. I am
> successfully able to do this on Windows platforms. JTidy replaces all
> the special charaters with their entity references, However when i try
> the same code on Unix platform the result is not desirable i get all
> ??????? characters in place of all the special characters. I don't why
> is this happening. Can anybody suggest why it is behaving this way. 

If it's the same code, then the difference has to be in the Java or OS 
environment. Clearly a different assumption is being made about the 
encoding of your source data. Windows is assuming an 8-bit encoding, 
probably your default Windows codepage, and Unix is assuming something 
different where the bytes don't decode properly.

But are you quite sure it's the same code? The "Windows" output seems 
consistent with the code, but the "Unix" output has a significantly 
different sequence of characters:

Windows starts  !"#$%&amp;'()*+,-./
Unix starts     !\"#$%&amp;&lt;&gt;'()

and there are more differences further on.
Received on Wednesday, 23 March 2005 09:49:36 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:55 GMT