W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2002

converting foreign characters

From: Bert Van Kets <bert@vankets.com>
Date: Sat, 25 May 2002 09:04:56 +0200
Message-Id: <5.1.0.14.0.20020525090419.038f88a0@mail.visitronics.be>
To: html-tidy@w3.org
Hi all,
I am using JTidy to convert a block of html to xhtml in Apache Cocoon.  I 
am having two problems with this.

1. When the string to be parsed contains invalid escaped (" ' ) or 
non-ascii (>127) characters they don't converted to their escaped html version.
Can I do a setting to tidy or do I have to build a Dictionary for this?  I 
suppose JTidy must have some correction built in for this since it must be 
a very common mistake.
I'm using a browser based html editor that's very simple to use, but does 
not convert the non-ascii characters correctly.

2. JTidy adds a html, head, title and body tag (I can remove them with 
XSLT, but that's messy)
Does JTidy *always* create full (X)HTML pages?

Here's the code from my XSP page:

       String strContent = request.getParameter("content");
       ByteArrayInputStream in = new ByteArrayInputStream( 
strContent.getBytes() );
       String strOut = "";
       org.w3c.dom.Document doc = null;
       org.w3c.tidy.Configuration conf = new org.w3c.tidy.Configuration();
       try {
         Tidy tidy = new Tidy();

         //create output as XML
         tidy.setXmlOut(true);

         //output should be XHTML conforming
         tidy.setXHTML(true);

         tidy.setBreakBeforeBR(false);
         tidy.setRawOut(false);
         tidy.setCharEncoding( conf.UTF8 );

         //do not output 'non-breaking space' as entity.
         tidy.setQuoteNbsp(true);

         //output naked ampersand as &amp;
         tidy.setQuoteAmpersand(true);

         //drop presentation tags
         tidy.setLiteralAttribs(true);

         //parse the stream to a DOM document
         doc =  tidy.parseDOM(in, null);
       } catch (Exception e) {
       }

It's possible that I am having too many settings but the code has grown as 
I was trying to get the output right.
Any help is welcome.
Bert
Received on Saturday, 25 May 2002 03:10:00 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:52 GMT