- From: Randy Waki <rwaki@sun10.whizbanglabs.com>
- Date: Mon, 13 Sep 1999 15:50:48 -0600
- To: "HTML Tidy Mailing List" <html-tidy@w3.org>
It would be nice if Tidy could recover from an unexpected "<" in the same
way as IE and Netscape.
Described below are what I think are the changes needed to do this. I'm
using the 26-Jul-99 version of Java Tidy. There should be a close
correspondence to the C version; my apologies for being unable to provide
the exact changes for C. These changes seem to work for us although I'm a
bit concerned that I may have missed something because Tidy makes a special
effort to escalate an unexpected "<" from a warning to an error.
It looks like there are two cases to consider: "<" in attribute names and
"<" in attribute values.
For example,
-------- Example input document --------
<html <BAD1>
<head>
<title>t</title>
</head>
<body>
<a href=foo.fee<BAD2>This is a link</a>
</body>
</html>
----------------------------------------
26-Jul-99 Tidy issues errors for the "<" before "BAD1" and "BAD2", plus some
spurious warnings/errors, and does not produce an output document. After
the changes described below, Tidy issues a warning about the unknown
attribute "<BAD1" and produces the following (which seems to correspond to
IE 5's and Netscape 4.5's interpretation):
-------- Example output document -------
<html>
<head>
<title>t</title>
</head>
<body>
<a href="foo.fee<BAD2">This is a link</a>
</body>
</html>
----------------------------------------
The changes are as follows.
First, for "<" in attribute names, search Lexer.java (lexer.c) for the first
occurrence of "UNEXPECTED_GT" and change:
this.in.ungetChar(c);
Report.attrError(this, this.token, null, Report.UNEXPECTED_GT);
return null;
to:
// The '<' is unexpected; recover by skipping to the next '>',
// '/', or whitespace character (mimics IE and Netscape).
StringBuffer buf = new StringBuffer();
buf.append('<');
while (c != '>' && c != '/' && c != StreamIn.EndOfStream)
{
map = MAP((char)c);
if ((map & WHITE) != 0)
break;
buf.append((char)c);
c = this.in.readChar();
}
this.in.ungetChar(c);
Report.attrError(this, this.token, buf.toString(),
Report.UNKNOWN_ATTRIBUTE);
continue;
Second, for "<" in attribute values, search Lexer.java (lexer.c) for the
remaining occurrence of "UNEXPECTED_GT" and change:
if (c == '<')
{
this.in.ungetChar(c);
Report.attrError(this, this.token, null, Report.UNEXPECTED_GT);
break;
}
to:
// Note: Accept unescaped '<' in attribute values (seems to mimic IE
// and Netscape).
In addition, once these two changes have been made, it should be possible to
remove everything associated with the reporting of the "UNEXPECTED_GT" error
although I haven't tried to do so.
Randy
Received on Monday, 13 September 1999 17:51:26 UTC