- From: Randy Waki <rwaki@flipdog.com>
- Date: Thu, 10 Aug 2000 16:54:04 -0600
- To: <html-tidy@w3.org>, <dsr@w3.org>
Here's another bug + patch for your consideration. 8-Jul-2000 Java Tidy throws a StringIndexOutOfBoundsException when processing the following document with the -utf8 command line option. It looks like the official C version of Tidy has the same bug although it shows no symptoms on Windows 2000. Note that -utf8 is required to trigger the bug with this document, but the bug could happen with any multibyte character encoding. The <body> tag has a non-standard attribute whose one-character name is a latin capital letter C with cedilla (the HTML character entity Ccedil, Unicode hex value c7). In UTF-8 this is represented as two bytes, c3 87. NOTE: The two-byte (one UTF-8 character) attribute name may get mangled in transit, so you may have to fix up the document by hand. The problem occurs near the end of ParseAttribute() in lexer.c. The code increments the len variable in an attempt to count the number of bytes added to lexbuf, but this fails to account for multibyte characters such as Ccedil. I think the fix is to compute the byte count using lexsize as done elsewhere in lexer.c: *************** *** 2267,2273 **** static char *ParseAttribute(Lexer *lexer, Bool *isempty, Node **asp, Node **php) { ! int map, start, len = 0; char *attr; uint c; --- 2267,2273 ---- static char *ParseAttribute(Lexer *lexer, Bool *isempty, Node **asp, Node **php) { ! int map, start; char *attr; uint c; *************** *** 2366,2377 **** if (!XmlTags && (map & uppercase) != 0) c += (uint)('a' - 'A'); - ++len; AddCharToLexer(lexer, c); c = ReadChar(lexer->in); } attr = (len > 0 ? wstrndup(lexer->lexbuf+start, len) : null); lexer->lexsize = start; --- 2366,2377 ---- if (!XmlTags && (map & uppercase) != 0) c += (uint)('a' - 'A'); AddCharToLexer(lexer, c); c = ReadChar(lexer->in); } + int len = lexer->lexsize - start; attr = (len > 0 ? wstrndup(lexer->lexbuf+start, len) : null); lexer->lexsize = start; ------------------------ Example HTML document ------------------------- <html> <head><title></title></head> <body Ç="xx"> <!-- attribute name is the two bytes hex c3 87 --> text </body> </html> ------------------------------------------------------------------------ Thanks, Randy
Received on Thursday, 10 August 2000 18:55:39 UTC