- From: Yudong Yang <yangyudong@hotmail.com>
- Date: Thu, 18 Nov 1999 18:17:52 +0800
- To: <html-tidy@w3.org>
- Message-ID: <19991118101801.14513.qmail@hotmail.com>
Hello, I'm using tidy as a HTML parser for my program. There are some cases that tidy will make mistakes on comment contains a lot of '-'(although this is not good according to HTML4.0 ). You could find the case in http://cn.yahoo.com/ I've made a fix for this problem. Best Regards Yudong Yang in lexer.c and function GetToken(Lexer *lexer, uint mode) case LEX_COMMENT: /* seen <!-- so look for --> */ /* look for 1st - */ if (c != '-') { if (comments > 2 && c == '>') ReportWarning(lexer, null, null, BAD_COMMENT); comments = -1; continue; } /* now look for 2nd - */ c = ReadChar(lexer->in); if (c == EndOfStream) { ReportWarning(lexer, null, null, BAD_COMMENT); UngetChar(c, lexer->in); continue; } AddCharToLexer(lexer, c); if (c != '-') { comments = 0; continue; } for(;;) // skip all ------ like strings { c = ReadChar(lexer->in); if(c != '-') { UngetChar(c, lexer->in); break; } AddCharToLexer(lexer, c); // just add the - } lexer->state = LEX_ENDCOMMENT; continue; case LEX_ENDCOMMENT: /* seen <!-- .... -- */ { int spc=0; for (;;) // first skip the possiable HTML 4.0 enabled white spaces { map = MAP(c); if((map & white) == 0) break; comments = 0; c = ReadChar(lexer->in); AddCharToLexer(lexer, c); spc ++; } if (c == '>') { lexer->lexsize -= 3+spc; lexer->txtend = lexer->lexsize; lexer->lexbuf[lexer->lexsize] = '\0'; lexer->state = LEX_CONTENT; lexer->waswhite = no; return lexer->token = CommentToken(lexer); } else { // not end of comment /* SGML comment syntax is truly daft!!! A comment declaration consists of `<!' followed by zero or more comments followed by `>'. Each comment starts with `--' and includes all text up to and including the next occurrence of `--'. In a comment declaration, white space is allowed after each comment, but not before the first comment. The entire comment declaration is ignored. <!-- another -- -- comment --> <!-- ---> is bad, so is <!-- foo ----- bar --> */ /* set error position just before offending chararcter */ lexer->lines = lexer->in->curline; lexer->columns = lexer->in->curcol - 1; ReportWarning(lexer, null, null, BAD_COMMENT); /* treat the chars as part of the comment*/ lexer->state = LEX_COMMENT; /* comment */ UngetChar(c, lexer->in); lexer->lexsize--; lexer->lexbuf[lexer->lexsize] = '\0'; continue; } }
Received on Thursday, 18 November 1999 05:18:37 UTC