W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2001

Re: Tidy cannot repair tags with missing '>'

From: Daniel Lieuwen <lieuwen@research.bell-labs.com>
Date: Wed, 11 Apr 2001 12:42:38 -0400
Message-ID: <3AD4897D.36D0844D@research.bell-labs.com>
To: html-tidy@w3.org
Returning the commented out break statement in

           if (c == '<')
            {
                /* UngetChar(c, lexer->in); *
                ReportAttrError(lexer, lexer->token, null, UNEXPECTED_GT);
                /* break; */
            }

from

static char *ParseValue(Lexer *lexer, char *name,^M
                        Bool foldCase, Bool *isempty, int *pdelim)

in lexer.c



would fix many of the cases where this problem is now occuring.


>
>
> From: Andy.Quick@sybase.com
> To: html-tidy@w3.org
> Date: Tue, 13 Mar 2001 10:34:42 -0500
> Message-ID: <OF74ED3597.22A94C99-ON85256A0E.004E4ADC@sybase.com>
> Subject: Tidy cannot repair tags with missing '>'
>
> I could not find anything about this, so I am posting it.
> In the example below, Tidy cannot repair the document
> because the <font> tag is badly formed - it is missing the
> '>'.
>
> <html>
> <head><title>Sample Problem</title></head>
> <body>
> <p>
> <font size="-2">There seems to be an error occurring when you don't</font>
> <font face="arial,helvetica, geneva" size="-2"<b>end</b> a tag with a &gt;.  Tidy won't fix it.</font>
> </p>
> </body>
> </html>
>
> I can propose a possible solution.  The attempt was made
> in Java tidy, so I will describe it in terms of Java tidy:
>
> 1.  In StreamInImpl, added a small stack to store characters
> so ungetChar can be used to back up more than 1 character.
>
>     public int readChar()
>     {
>         int c;
>
>         if (this.pushed)
>         {
>             this.prevPos--;
>             if( this.prevPos == 0 ) this.pushed = false;
>             c = this.previous[ this.prevPos ];
>
>             if (c == '\n')
>             {
>                 this.curcol = 1;
>                 this.curline++;
>                 return c;
>             }
>
>             this.curcol++;
>             return c;
>         }
>         ....
>
>     public void ungetChar(int c)
>     {
>         this.pushed = true;
>         if( this.prevPos == 5 ) this.prevPos = 0; // Reset counter.
>         this.previous[ this.prevPos ] = c;
>         this.prevPos++;
>
>         if (c == '\n')
>         {
>             --this.curline;
>         }
>
>         this.curcol = this.lastcol;
>     }
>
> New class members:
>     protected int[] previous;
>     protected int   prevPos;
>
> Constructor:
>     this.previous = new int[ 5 ]; // allow 5 backup chars
>     this.prevPos = 0;
>
> 2.  In Lexer, attempted to recover from unexpected '<'s
> in parseAttribute and parseValue.
>
> line 2179
>                 this.in.ungetChar(c);
>                 /* Report.attrError(this, this.token, null, Report.UNEXPECTED_GT); */
>      c = '<';
>      this.in.ungetChar(c);
>      return null;
>
> line 2445
>                     /* this.in.ungetChar(c); */
>                     /* Report.attrError(this, this.token, null, Report.UNEXPECTED_GT); */
>          this.in.ungetChar(c);
>          c = '>';
>          this.in.ungetChar(c);
>          c = lastc;
>          continue;
>                     /* break; */
>
> Regards,
>
> Andy Quick
>
>   ------------------------------------------------------------------------
>
>    * Next message: J. David Bryan: "RE: Using strict doctype"
>    * Previous message: Pim van Arend: "Re: tidy without indenting?"
>    * Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>    * Other mail archives: [this mailing list] [other W3C mailing lists]
>    * Mail actions: [ respond to this message ] [ mail a new topic ]
Received on Wednesday, 11 April 2001 12:42:03 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:45 GMT