Bug: Bad length for attribute names with multibyte characters in lexer.c

Here's another bug + patch for your consideration.

8-Jul-2000 Java Tidy throws a StringIndexOutOfBoundsException when
processing the following document with the -utf8 command line option.
It looks like the official C version of Tidy has the same bug although
it shows no symptoms on Windows 2000.  Note that -utf8 is required to
trigger the bug with this document, but the bug could happen with any
multibyte character encoding.

The <body> tag has a non-standard attribute whose one-character name is
a latin capital letter C with cedilla (the HTML character entity Ccedil,
Unicode hex value c7).  In UTF-8 this is represented as two bytes, c3 87.
NOTE: The two-byte (one UTF-8 character) attribute name may get mangled
in transit, so you may have to fix up the document by hand.

The problem occurs near the end of ParseAttribute() in lexer.c.  The
code increments the len variable in an attempt to count the number of
bytes added to lexbuf, but this fails to account for multibyte
characters such as Ccedil.

I think the fix is to compute the byte count using lexsize as done
elsewhere in lexer.c:

***************
*** 2267,2273 ****
  static char  *ParseAttribute(Lexer *lexer, Bool *isempty,
                               Node **asp, Node **php)
  {
!     int map, start, len = 0;
      char *attr;
      uint c;

--- 2267,2273 ----
  static char  *ParseAttribute(Lexer *lexer, Bool *isempty,
                               Node **asp, Node **php)
  {
!     int map, start;
      char *attr;
      uint c;

***************
*** 2366,2377 ****
          if (!XmlTags && (map & uppercase) != 0)
              c += (uint)('a' - 'A');

-         ++len;
          AddCharToLexer(lexer, c);

          c = ReadChar(lexer->in);
      }

      attr = (len > 0 ? wstrndup(lexer->lexbuf+start, len) : null);
      lexer->lexsize = start;

--- 2366,2377 ----
          if (!XmlTags && (map & uppercase) != 0)
              c += (uint)('a' - 'A');

          AddCharToLexer(lexer, c);

          c = ReadChar(lexer->in);
      }

+     int len = lexer->lexsize - start;
      attr = (len > 0 ? wstrndup(lexer->lexbuf+start, len) : null);
      lexer->lexsize = start;

------------------------ Example HTML document -------------------------
<html>
  <head><title></title></head>
  <body Ç="xx">  <!-- attribute name is the two bytes hex c3 87 -->
    text
  </body>
</html>
------------------------------------------------------------------------

Thanks,
Randy

Received on Thursday, 10 August 2000 18:55:39 UTC