W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2000

[Tidy/Jtidy bug report]ArrayIndexOutOfBoundsException

From: Donna Bergmark <bergmark@CS.Cornell.EDU>
Date: Mon, 05 Jun 2000 14:08:33 -0400
Message-Id: <200006051808.OAA07367@elgin.cs.cornell.edu>
To: html-tidy@w3.org
There still seems to be a problem with ArrayIndexOutOfBounds.
I am running Tidy Version 30 April 2000 and JTidy Version 3 June.
A variant of the Tidy Java Bean example was used.  Here is the
URL that won't parse:
http://www.dlib.org/dlib/september98/millman/09millman.html

Here is the typescript of the run:
Script started on Mon Jun  5 13:44:12 2000
------------------------------------------------------------------
Latest version of Tidy, JTidy still bombs on ArrayIndexOutOfBounds
------------------------------------------------------------------
(1) Put 3 June 2000 JTidy into my path

DHCP211-162.CS.CORNELL.EDU% setenv CLASSPATH \
? /home/bergmark/public/src/tools/JTidy/src/30apr2000:/usr/local/jdk1.2.2/lib:.

DHCP211-162.CS.CORNELL.EDU% echo $CLASSPATH
/home/bergmark/public/src/tools/JTidy/src/30apr2000:/usr/local/jdk1.2.2/lib:.

(2) Get original version of the JTidy bean (copied from the Java HTML Tidy
document).  Compile it.

DHCP211-162.CS.CORNELL.EDU% co -r1.1 Test16.java
RCS/Test16.java,v  -->  Test16.java
revision 1.1
writable Test16.java exists; remove it? [ny](n): y
done

DHCP211-162.CS.CORNELL.EDU% javac Test16.java

(3) Run it on the URL that causes Tidy/JTidy to crash

DHCP211-162.CS.CORNELL.EDU% cat millman.notidy
http://www.dlib.org/dlib/september98/millman/09millman.html

DHCP211-162.CS.CORNELL.EDU% java Test16 \
? http://www.dlib.org/dlib/september98/millman/09millman.html \
? out error
java.lang.ArrayIndexOutOfBoundsException: 0 >= 0
	at java.util.Vector.elementAt(Vector.java:405)
	at org.w3c.tidy.Lexer.insertedToken(Lexer.java:2738)
	at org.w3c.tidy.Lexer.getToken(Lexer.java:1185)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:1672)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:49)
	at org.w3c.tidy.ParserImpl.access$0(ParserImpl.java:36)
	at org.w3c.tidy.ParserImpl$ParseDefList.parse(ParserImpl.java:1438)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:49)
	at org.w3c.tidy.ParserImpl.access$0(ParserImpl.java:36)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:2002)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:49)
	at org.w3c.tidy.ParserImpl.access$0(ParserImpl.java:36)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:2002)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:49)
	at org.w3c.tidy.ParserImpl.access$0(ParserImpl.java:36)
	at org.w3c.tidy.ParserImpl$ParseBody.parse(ParserImpl.java:652)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:49)
	at org.w3c.tidy.ParserImpl.access$0(ParserImpl.java:36)
	at org.w3c.tidy.ParserImpl$ParseHTML.parse(ParserImpl.java:258)
	at org.w3c.tidy.ParserImpl.parseDocument(ParserImpl.java:2917)
	at org.w3c.tidy.Tidy.parse(Tidy.java:1055)
	at Test16.run(Test16.java:50)
	at java.lang.Thread.run(Thread.java:475)

(4) Here is what was written into the output files

DHCP211-162.CS.CORNELL.EDU% cat error

Tidy (vers 30th April 2000) Parsing "InputStream"
line 10 column 61 - Warning: discarding unexpected </a>
line 16 column 23 - Warning: missing </font> before <h3>
line 16 column 27 - Warning: inserting implicit <font>
line 20 column 1 - Warning: inserting implicit <font>
line 23 column 2 - Warning: missing </font> before <h6>
line 23 column 5 - Warning: inserting implicit <font>
line 26 column 1 - Warning: inserting implicit <font>
line 29 column 23 - Warning: missing </font> before <h3>
line 29 column 23 - Warning: missing </font> before <h3>
line 29 column 27 - Warning: inserting implicit <font>
line 29 column 27 - Warning: inserting implicit <font>
line 30 column 1 - Warning: discarding unexpected </font>
line 32 column 1 - Warning: inserting implicit <font>
line 34 column 1 - Warning: missing </font> before <p>
line 35 column 1 - Warning: inserting implicit <font>
line 41 column 53 - Warning: discarding unexpected </i>
line 43 column 2 - Warning: missing </i> before <p>
line 43 column 2 - Warning: missing </font> before <p>
line 43 column 2 - Warning: missing </font> before <p>
line 47 column 3 - Warning: <img> lacks "alt" attribute
line 57 column 2 - Warning: missing </font> before <p>
line 57 column 2 - Warning: missing </h3> before <p>
line 58 column 1 - Warning: inserting implicit <font>
line 60 column 1 - Warning: discarding unexpected </h3>
line 60 column 6 - Warning: discarding unexpected </font>
line 60 column 13 - Warning: discarding unexpected </font>
line 60 column 20 - Warning: discarding unexpected </font>
line 72 column 1 - Warning: trimming empty <p>
line 72 column 55 - Warning: missing </font> before </h3>
line 72 column 60 - Warning: discarding unexpected </font>
line 72 column 67 - Warning: replacing element</p> by <br>
line 72 column 67 - Warning: inserting implicit <br>
line 99 column 1 - Warning: trimming empty <p>
line 99 column 45 - Warning: missing </font> before </h3>
line 99 column 50 - Warning: discarding unexpected </font>
line 99 column 57 - Warning: replacing element</p> by <br>
line 99 column 57 - Warning: inserting implicit <br>
line 105 column 281 - Warning: trimming empty <p>
line 106 column 646 - Warning: trimming empty <p>
line 107 column 129 - Warning: trimming empty <p>
line 108 column 356 - Warning: trimming empty <p>
line 109 column 115 - Warning: trimming empty <p>
line 115 column 5 - Warning: missing </em> before <dl>
line 115 column 5 - Warning: trimming empty <em>
line 116 column 5 - Warning: inserting implicit <em>
line 117 column 1 - Warning: missing <dd>
line 117 column 1 - Warning: discarding unexpected </em>

DHCP211-162.CS.CORNELL.EDU% cat out
DHCP211-162.CS.CORNELL.EDU% 
DHCP211-162.CS.CORNELL.EDU% exit
exit

Script done on Mon Jun  5 13:46:39 2000
..............................................................
Here is the java code that invoked the parse:
// bergmark - may 2000 - Code example of how to use the Tidy Java Bean

// Code copied from Java HTML Tidy (13 May 2000) document

// CLASSPATH:  must include path to JTidy:
// /home/bergmark/public/src/tools/JTidy/src/30apr2000

import java.io.IOException;
import java.net.URL;
import java.io.BufferedInputStream;
import java.io.FileOutputStream;
import java.io.PrintWriter;
import java.io.FileWriter;
import org.w3c.tidy.Tidy;

/**
 * This program shows how HTML could be tidied directly from
 * a URL stream, and running on separate threads.  Note the use
 * of the 'parse' method to parse from an InputStream, and send
 * the pretty-printed result to an OutputStream.
 * In this example thread th1 outputs XML, and thread th2 outputs
 * HTML.  This shows that properties are per instance of Tidy.
 */

public class Test16 implements Runnable {

   private String url;
   private String outFileName;
   private String errOutFileName;

   public Test16(String url, String outFileName,
      String errOutFileName) {
      this.url = url;
      this.outFileName = outFileName;
      this.errOutFileName = errOutFileName;
   }

   public void run() {
      URL u;
      BufferedInputStream in;
      FileOutputStream out;
      Tidy tidy = new Tidy();

      tidy.setXHTML(true);
      try {
         tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true));
         u = new URL(url);
         in = new BufferedInputStream(u.openStream());
         out = new FileOutputStream(outFileName);
         tidy.parse(in, out);
      } catch ( IOException e ) {
         System.out.println ( this.toString() + e.toString() );
      }
   }

   public static void main( String[] args ) {
      Test16 t = new Test16(args[0], args[1], args[2] );
      Thread th1 = new Thread(t);
      th1.start();
   }
}
Received on Monday, 5 June 2000 14:08:35 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:43 GMT