Re: Buffer overflow bug in Validator?

From: Martin Duerst (duerst@w3.org)
Date: Wed, Jul 18 2001

  • Next message: Terje Bless: "Re: xhtml 1.1 logo"

    Message-Id: <4.2.0.58.J.20010718174636.03733960@sh.w3.mag.keio.ac.jp>
    Date: Wed, 18 Jul 2001 19:04:30 +0900
    To: www-validator@w3.org, "Christian Wolfgang Hujer" <christian@hujer.com>
    From: Martin Duerst <duerst@w3.org>
    Subject: Re: Buffer overflow bug in Validator?
    
    Reporting back to the list. This bug should now be fixed.
    Christian, please check at http://validator.w3.org:8188/.
    
    It took Terje and me quite a while.
    Below are the details (feel free to ignore).
    
    But I'm not sure we should leave it as it is. It would
    probably make sense to put some limit on overall file
    length, to avoid denial of service attacks.
    
    Regards,   Martin.
    
    
    
    Terje traced the overflow bug to the UTF-8 checking regexp.
    We tried a few things this morning, but didn't make progress.
    
    When I run a simple test program on my side, I confirmed
    Terje's suspicion. (I seem to have a perl interpreter with
    better error reporting (activestate on win98), he just
    got a seg fault).
    
    It looks like Perl cannot match more than 32766 UTF-8 characters,
    because it wants to backtrack on that expression.
    
    Here is the test program I used:
    
     >>>>
    #!/usr/local/bin/perl -w
    # We need Perl 5.004.
    require 5.004;
    
    #
    # Load modules
    use strict;
    
    my $t = 'a';
    my $l = 1;
    while (1) {
    	if ($t =~ m/
    		^(
    		[\x00-\x7F]                             # ASCII
    		| [\xC2-\xDF]        [\x80-\xBF]        # non-overlong 2-byte sequences
    		|  \xE0[\xA0-\xBF]   [\x80-\xBF]        # excluding overlongs
    		| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}     # straight 3-byte sequences
    		|  \xED[\x80-\x9F]   [\x80-\xBF]        # excluding surrogates
    		|  \xF0[\x90-\xBF]   [\x80-\xBF]{2}     # planes 1-3
    		| [\xF1-\xF3]        [\x80-\xBF]{3}     # planes 4-15
    		|  \xF4[\x80-\x8F][\x80-\xBF]{2}        # plane 16
    		)*$
    		/x)
    	{ print STDERR "matched"; }
    	else { print STDERR "unmatched"; }
    	print STDERR " length $l\n";
    	$t .= $t;
    	$l += $l;
    }
     >>>>
    
    And here is what I got:
    
     >>>>
    perl test.pl
    matched length 1
    matched length 2
    matched length 4
    matched length 8
    matched length 16
    matched length 32
    matched length 64
    matched length 128
    matched length 256
    matched length 512
    matched length 1024
    matched length 2048
    matched length 4096
    matched length 8192
    matched length 16384
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 32768
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 65536
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 131072
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 262144
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 524288
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 1048576
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 2097152
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 4194304
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 8388608
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 16777216
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 33554432
    Complex regular subexpression recursion limit (32766) exceeded at test.pl 
    line 12.
    unmatched length 67108864
    Out of memory!
     >>>>
    
    Of course, there is absolutely no need to do backtracking/recursion, but
    perl doesn't see that. After an hour or so of concentrated leafing through
    Jeffrey Fiedl's book and the perl book, I guess I finally got
    some reasonable workaround:
    
    #!/usr/local/bin/perl -w
    # We need Perl 5.004.
    require 5.004;
    
    #
    # Load modules
    use strict;
    
    my $t = 'a';
    my $l = 1;
    while (1) {
    	my $p;
    	while ($t =~ m/\G(
    		  [\x00-\x7F]                           # ASCII
    		| [\xC2-\xDF]        [\x80-\xBF]        # non-overlong 2-byte sequences
    		|  \xE0[\xA0-\xBF]   [\x80-\xBF]        # excluding overlongs
    		| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}     # straight 3-byte sequences
    		|  \xED[\x80-\x9F]   [\x80-\xBF]        # excluding surrogates
    		|  \xF0[\x90-\xBF]   [\x80-\xBF]{2}     # planes 1-3
    		| [\xF1-\xF3]        [\x80-\xBF]{3}     # planes 4-15
    		|  \xF4[\x80-\x8F][\x80-\xBF]{2}        # plane 16
    		)
    		/xg) { $p = pos $t; }
    	if ($p == length $t) {
    		print STDERR "matched";
    	}
    	else {
    		print STDERR "unmatched";
    	}
    	print STDERR " length $l\n";
    	$t .= $t;
    	$l += $l;
    }
    
    The trick is that /g does successive small matches instead of trying
    to match it all at one time, \G makes sure we start exactly where
    we left off (because we have to match all of the string), and
    $p == length $t checks whether we reached the end.
    $p = pos $t; is actually necessary in the loop; taken out of the
    loop perl complains that $p doesn't have a value.
    
    This produces (after a lot of thrashing) for the longer ones:
    
    matched length 1
    matched length 2
    matched length 4
    matched length 8
    matched length 16
    matched length 32
    matched length 64
    matched length 128
    matched length 256
    matched length 512
    matched length 1024
    matched length 2048
    matched length 4096
    matched length 8192
    matched length 16384
    matched length 32768
    matched length 65536
    matched length 131072
    matched length 262144
    matched length 524288
    matched length 1048576
    matched length 2097152
    matched length 4194304
    matched length 8388608
    matched length 16777216
    matched length 33554432
    matched length 67108864
    matched length 134217728
    Out of memory!
    
    Milage may vary on your system. And at some point, I guess
    the validator will just have to give up.
    
    
    Regards,   Martin.
    
    
    At 04:00 01/07/17 +0200, Terje Bless wrote:
    On 29.06.01 at 20:19, Martin Duerst <duerst@w3.org> wrote:
    
     >At 13:05 01/06/29 +0200, Christian Wolfgang Hujer wrote:
     >>I think I have found a bug in the validator or perhaps the underlaying
     >>software. I got a reproducable 500 internal server error from the
     >>validator when validating the result one of my perl scripts produced. It
     >>contained a long long single line (physical ASCII/UTF-8 line) with more
     >>than 240000 characters resp. bytes (without those unneccessary SP, CR or
     >>LF).
     >
     >Interesting. I'll have a look at it if it turns out that it's an
     >i18n-related problem. Otherwise, I suggest that Terje or Gerald have a
     >look at it.
    
    I finally got around to checking it out and it turns out it's in your code;
    specifically in the UTF-8 checking regex. Presumably we run out of memory
    when a huge regex is presented with an overlong string. I'm looking into it
    in the hopes that I can find a workaround, but we may have to refactor that
    code somehow. I'll let you know what I come up with.
    
    BTW, is there any way to split an UTF-8 string such that we are guaranteed
    that we split it on a character boundary? Perhaps some simple pattern we
    could look for that is guaranteed to indicate a character boundary? Or
    maybe a specific number of bytes that will always be a whole number of
    UTF-8 characters?
    
    IOW, I'm looking for some safe way we can split that sequence of bytes up,
    before feeding it to the regex engine, that will not generate invalid UTF-8
    sequences.