W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2002

converting foreign characters

From: Bert Van Kets <bert@vankets.com>
Date: Sat, 25 May 2002 09:04:56 +0200
Message-Id: <>
To: html-tidy@w3.org
Hi all,
I am using JTidy to convert a block of html to xhtml in Apache Cocoon.  I 
am having two problems with this.

1. When the string to be parsed contains invalid escaped (" ' ) or 
non-ascii (>127) characters they don't converted to their escaped html version.
Can I do a setting to tidy or do I have to build a Dictionary for this?  I 
suppose JTidy must have some correction built in for this since it must be 
a very common mistake.
I'm using a browser based html editor that's very simple to use, but does 
not convert the non-ascii characters correctly.

2. JTidy adds a html, head, title and body tag (I can remove them with 
XSLT, but that's messy)
Does JTidy *always* create full (X)HTML pages?

Here's the code from my XSP page:

       String strContent = request.getParameter("content");
       ByteArrayInputStream in = new ByteArrayInputStream( 
strContent.getBytes() );
       String strOut = "";
       org.w3c.dom.Document doc = null;
       org.w3c.tidy.Configuration conf = new org.w3c.tidy.Configuration();
       try {
         Tidy tidy = new Tidy();

         //create output as XML

         //output should be XHTML conforming

         tidy.setCharEncoding( conf.UTF8 );

         //do not output 'non-breaking space' as entity.

         //output naked ampersand as &amp;

         //drop presentation tags

         //parse the stream to a DOM document
         doc =  tidy.parseDOM(in, null);
       } catch (Exception e) {

It's possible that I am having too many settings but the code has grown as 
I was trying to get the output right.
Any help is welcome.
Received on Saturday, 25 May 2002 03:10:00 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:52 UTC