Recovering from an unexpected "<"

It would be nice if Tidy could recover from an unexpected "<" in the same
way as IE and Netscape.

Described below are what I think are the changes needed to do this.  I'm
using the 26-Jul-99 version of Java Tidy.  There should be a close
correspondence to the C version; my apologies for being unable to provide
the exact changes for C.  These changes seem to work for us although I'm a
bit concerned that I may have missed something because Tidy makes a special
effort to escalate an unexpected "<" from a warning to an error.

It looks like there are two cases to consider: "<" in attribute names and
"<" in attribute values.

For example,

-------- Example input document --------
<html <BAD1>
<a href=foo.fee<BAD2>This is a link</a>

26-Jul-99 Tidy issues errors for the "<" before "BAD1" and "BAD2", plus some
spurious warnings/errors, and does not produce an output document.  After
the changes described below, Tidy issues a warning about the unknown
attribute "<BAD1" and produces the following (which seems to correspond to
IE 5's and Netscape 4.5's interpretation):

-------- Example output document -------
<a href="foo.fee&lt;BAD2">This is a link</a>

The changes are as follows.

First, for "<" in attribute names, search (lexer.c) for the first
occurrence of "UNEXPECTED_GT" and change:;
    Report.attrError(this, this.token, null, Report.UNEXPECTED_GT);
    return null;


    // The '<' is unexpected; recover by skipping to the next '>',
    // '/', or whitespace character (mimics IE and Netscape).
    StringBuffer buf = new StringBuffer();
    while (c != '>' && c != '/' && c != StreamIn.EndOfStream)
        map = MAP((char)c);
        if ((map & WHITE) != 0)

        c =;
    Report.attrError(this, this.token, buf.toString(),

Second, for "<" in attribute values, search (lexer.c) for the
remaining occurrence of "UNEXPECTED_GT" and change:

    if (c == '<')
        Report.attrError(this, this.token, null, Report.UNEXPECTED_GT);


    // Note: Accept unescaped '<' in attribute values (seems to mimic IE
    // and Netscape).

In addition, once these two changes have been made, it should be possible to
remove everything associated with the reporting of the "UNEXPECTED_GT" error
although I haven't tried to do so.


Received on Monday, 13 September 1999 17:51:26 UTC