W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2001

Tidy cannot repair tags with missing '>'

From: <Andy.Quick@sybase.com>
Date: Tue, 13 Mar 2001 10:34:42 -0500
To: html-tidy@w3.org
Message-ID: <OF74ED3597.22A94C99-ON85256A0E.004E4ADC@sybase.com>
I could not find anything about this, so I am posting it.
In the example below, Tidy cannot repair the document
because the <font> tag is badly formed - it is missing the
'>'.

<html>
<head><title>Sample Problem</title></head>
<body>
<p>
<font size="-2">There seems to be an error occurring when you don't</font>
<font face="arial,helvetica, geneva" size="-2"<b>end</b> a tag with a &gt;.  Tidy won't fix it.</font>
</p>
</body>
</html>

I can propose a possible solution.  The attempt was made
in Java tidy, so I will describe it in terms of Java tidy:

1.  In StreamInImpl, added a small stack to store characters
so ungetChar can be used to back up more than 1 character.

    public int readChar()
    {
        int c;

        if (this.pushed)
        {
            this.prevPos--;
            if( this.prevPos == 0 ) this.pushed = false;
            c = this.previous[ this.prevPos ];

            if (c == '\n')
            {
                this.curcol = 1;
                this.curline++;
                return c;
            }

            this.curcol++;
            return c;
        }
        ....

    public void ungetChar(int c)
    {
        this.pushed = true;
        if( this.prevPos == 5 ) this.prevPos = 0; // Reset counter.
        this.previous[ this.prevPos ] = c;
        this.prevPos++;

        if (c == '\n')
        {
            --this.curline;
        }

        this.curcol = this.lastcol;
    }

New class members:
    protected int[] previous;
    protected int   prevPos;

Constructor:
    this.previous = new int[ 5 ]; // allow 5 backup chars
    this.prevPos = 0;

2.  In Lexer, attempted to recover from unexpected '<'s
in parseAttribute and parseValue.

line 2179
                this.in.ungetChar(c);
                /* Report.attrError(this, this.token, null, Report.UNEXPECTED_GT); */
     c = '<';
     this.in.ungetChar(c);
     return null;

line 2445
                    /* this.in.ungetChar(c); */
                    /* Report.attrError(this, this.token, null, Report.UNEXPECTED_GT); */
         this.in.ungetChar(c);
         c = '>';
         this.in.ungetChar(c);
         c = lastc;
         continue;
                    /* break; */


Regards,

Andy Quick
Received on Tuesday, 13 March 2001 10:34:38 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:45 GMT