W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2002

converting foreign characters (fwd)

From: Dave Raggett <dsr@w3.org>
Date: Sat, 25 May 2002 14:22:45 +0100 (BST)
To: html-tidy@w3.org
cc: Bert Van Kets <bert@vankets.com>
Message-ID: <Pine.LNX.4.44.0205251422010.2035-100000@hazel>
Forwarded for group comment.

---------- Forwarded message ----------
Date: Fri, 24 May 2002 22:32:18 +0200
From: Bert Van Kets <bert@vankets.com>
To: Dave Raggett <dsr@w3.org>
Subject: converting foreign characters

Hi all,
I am using JTidy to convert a block of html to xhtml in Apache Cocoon.  I 
am having two problems with this.

1. When the string to be parsed contains invalid escaped (" ' ) or 
non-ascii (>127) characters they don't converted to their escaped html version.
Can I do a setting to tidy or do I have to build a Dictionary for this?  I 
suppose JTidy must have some correction built in for this since it must be 
a very common mistake.
I'm using a browser based html editor that's very simple to use, but does 
not convert the non-ascii characters correctly.

2. JTidy adds a html, head, title and body tag (I can remove them with 
XSLT, but that's messy)
Does JTidy *always* create full (X)HTML pages?

Here's the code from my XSP page:

       String strContent = request.getParameter("content");
       ByteArrayInputStream in = new ByteArrayInputStream( 
strContent.getBytes() );
       String strOut = "";
       org.w3c.dom.Document doc = null;
       org.w3c.tidy.Configuration conf = new org.w3c.tidy.Configuration();
       try {
         Tidy tidy = new Tidy();

         //create output as XML

         //output should be XHTML conforming

         tidy.setCharEncoding( conf.UTF8 );

         //do not output 'non-breaking space' as entity.

         //output naked ampersand as &amp;

         //drop presentation tags

         //parse the stream to a DOM document
         doc =  tidy.parseDOM(in, null);
       } catch (Exception e) {

It's possible that I am having too many settings but the code has grown as 
I was trying to get the output right.
Any help is welcome.
Received on Saturday, 25 May 2002 09:22:53 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:52 UTC