Java HTML Tidy

Updated 12 Feb 2000

Thanks to the people at Docuverse for assistance with DOM support for Java HTML Tidy.

I have made available:

a zip file containing the Tidy source, documentation, and Bean, and
a gzipped tar file of the same.

To use the Tidy Java Bean, just include JTidy\lib\Tidy.jar in your classpath.

To build Tidy from the source, you need a Java compiler/runtime environment, supporting Java 1.1 or higher. First, download and expand the archive. For Win 9x/NT, build it using the batch file JTidy\make\build.bat as follows:


    cd JTidy\make

    build c: 13jan2000

Where c: is the root where you expanded the JTidy archive, and 13jan2000 is the directory under JTidy\src where the source is located. NOTE: build.bat assumes that the environment variable java_home points to your JDK installation, and that the JDK tools are in your path.

For Unix environments, either Cygwin or true, use the makefile in JTidy\make.

The main class is: org.w3c.tidy.Tidy

What's New

18 Jun 1999 - Java Tidy is now a Java Bean.
10 Jul 1999 - Updated Java Tidy to match HTML Tidy Release 7 Jul 1999
17 Jul 1999 - Fixed some bugs. Added code examples to document.
27 Jul 1999 - Fixed bug in Node.clone()
30 Jul 1999 - Updated Java Tidy to match HTML Tidy Release 26 Jul 1999. Repackaged Java Tidy.
28 Aug 1999 - Changed property docTypeStr to docType and handled the same as configuration file 'doctype' string. Fixed potential IndexOutOfBoundsException's in Clean.createProps.
3 Sep 1999 - Fixed "thread-safeness" issue in ParserImpl class. Added InputStreamName property to Bean. Tried speed optimization in Lexer.wstrcasecmp.
23 Sep 1999 - Applied bug fix from tidy mailing list (15 Aug 1999) to ParserImpl.ParseList.
2 Oct 1999 - Added limited DOM support. Basically, all you can do right now is read elements and attributes of the parse tree. You cannot modify the parse tree in any way. I have made fields of org.w3c.tidy.Node protected, as I would like to phase out external use of this class, and phase in DOM-style access to the parse tree. Below is a code example of DOM-style parse tree traversal and printing. Note that Java Tidy cannot yet be called DOM-compliant, but it's getting there.
23 Oct 1999 - Updated Java Tidy to match HTML Tidy Release 27 Sep 1999. The following new features of C HTML Tidy are NOT supported by Java Tidy: (1) the "keep-time" option for preserving file times, and (2) the new command-line option parsing that supports parsing options prefixed by "--" in the same way as parsing from the configuration file. The reason for (1) is that the core Java API doesn't support altering file modification times. The reason for (2) is that I implemented the configuration file as a properties file, and as such the option parsing code cannot be re-used for command-line option parsing. This means I need an independent method to parse options from the command line. It's on my list.
23 Oct 1999 - Moved TidyMessages.properties to the org.w3c.tidy package. Propogated MissingResourceException in static initializer of Report as an Error since it represents a severe error.
1 Nov 1999 - Updated Java Tidy to match HTML Tidy Release 22 Oct 1999. This seems to fix some severe bugs (such as infinite loops) that were present in the 27 Sep 1999 update. However, my testing has revealed that there are still some severe bugs, so I have included the 4sep1999dom source tree with this release in case you want to stay at a stable release until the severe bugs are fixed.
7 Nov 1999 - Changed Lexer.lexbuf to type byte[]. Since Tidy stores lexbuf as UTF-8 encoded bytes, conversions of sequences of bytes of lexbuf to String's need to take into account UTF-8 encoding, also it was a waste of memory to represent it as a char[]. Thanks to Mark Diekhans for contributing this change.
16 Nov 1999 - Minor bug fix with UTF8 encoding string. Added makefile.
6 Dec 1999 - Updated Java Tidy to match HTML Tidy Release 30 Nov 1999.
7 Dec 1999 - Fixed bug in Lexer.getToken.
29 Dec 1999 - More DOM support.
22 Jan 2000 - Updated Java Tidy to match HTML Tidy Release 13 Jan 2000. Included CM_HEAD fix from Dave Raggett (posted to HTML tidy mailing list) for object entry in TagTable. Fixed cause of NullPointerException's in Node.insertNodeAfterElement.
12 Feb 2000 - Fixed bug in Lexer.addGenerator(). Got rid of references to Lexer.lexbuf in class PPRint and other classes - used node.textarray instead. This is in preparation for more DOM implementation.

Code example of how to use the Tidy Java Bean


import java.io.IOException;

import java.net.URL;

import java.io.BufferedInputStream;

import java.io.FileOutputStream;

import java.io.PrintWriter;

import java.io.FileWriter;

import org.w3c.tidy.Tidy;





/**

 * This program shows how HTML could be tidied directly from

 * a URL stream, and running on separate threads.  Note the use

 * of the 'parse' method to parse from an InputStream, and send

 * the pretty-printed result to an OutputStream.

 * In this example thread th1 outputs XML, and thread th2 outputs

 * HTML.  This shows that properties are per instance of Tidy.

 */



public class Test16 implements Runnable {



    private String url;

    private String outFileName;

    private String errOutFileName;

    private boolean xmlOut;



    public Test16(String url, String outFileName,

                  String errOutFileName, boolean xmlOut)

    {

        this.url = url;

        this.outFileName = outFileName;

        this.errOutFileName = errOutFileName;

        this.xmlOut = xmlOut;

    }



    public void run()

    {

        URL u;

        BufferedInputStream in;

        FileOutputStream out;

        Tidy tidy = new Tidy();



        tidy.setXmlOut(xmlOut);

        try {

            tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true));

            u = new URL(url);

            in = new BufferedInputStream(u.openStream());

            out = new FileOutputStream(outFileName);

            tidy.parse(in, out);

        }

        catch ( IOException e ) {

            System.out.println( this.toString() + e.toString() );

        }

    }



    public static void main( String[] args ) {

        Test16 t1 = new Test16(args[0], args[1], args[2], true);

        Test16 t2 = new Test16(args[3], args[4], args[5], false);

        Thread th1 = new Thread(t1);

        Thread th2 = new Thread(t2);



        th1.start();

        th2.start();

    }



}

Code example of using Java Tidy as a parser


import java.io.PrintWriter;

import java.io.FileInputStream;

import java.io.IOException;



import org.w3c.dom.Attr;

import org.w3c.dom.Document;

import org.w3c.dom.NamedNodeMap;

import org.w3c.dom.Node;

import org.w3c.dom.NodeList;

import org.w3c.tidy.Tidy;



/**

 * A sample DOM writer. This sample program illustrates how to

 * traverse a DOM tree in order to print a document that is parsed.

 *

 */

public class TestDOM {





   protected PrintWriter out;



   public TestDOM() {

      out = new PrintWriter(System.out);

   }



   /** Prints the specified node, recursively. */

   public void print(Node node) {



      if ( node == null ) {

         return;

      }



      int type = node.getNodeType();

      switch ( type ) {

      case Node.DOCUMENT_NODE:

         out.println("");

         print(((Document)node).getDocumentElement());

         out.flush();

         break;



      case Node.ELEMENT_NODE:

         out.print('<');

         out.print(node.getNodeName());

         NamedNodeMap attrs = node.getAttributes();

         for ( int i = 0; i < attrs.getLength(); i++ ) {

            out.print(' ');

            out.print(attrs.item(i).getNodeName());

            out.print("=\"");

            out.print(attrs.item(i).getNodeValue());

            out.print('"');

         }

         out.print('>');

         out.println(); // HACK

         NodeList children = node.getChildNodes();

         if ( children != null ) {

            int len = children.getLength();

            for ( int i = 0; i < len; i++ ) {

               print(children.item(i));

            }

         }

         break;



      case Node.TEXT_NODE:

         out.print(node.getNodeValue());

         break;



      }



      if ( type == Node.ELEMENT_NODE ) {

         out.print("');

         out.println(); // HACK

      }



      out.flush();



   }



   public static void main(String args[]) {



      if ( args.length == 0 ) {

         System.exit(1);

      }



      System.err.println(args[0]);



      FileInputStream in;

      Tidy tidy = new Tidy();

      TestDOM t = new TestDOM();



      try {

          in = new FileInputStream(args[0]);

          tidy.setMakeClean(true);

          tidy.setXmlTags(true);

          t.print(tidy.parseDOM(in, null));

      }

      catch ( IOException e ) {

          System.err.println( e.toString() );

      }



   }





}

Send questions, comments, or bug reports to Andy Quick.