Re: Buffer overflow bug in Validator?

Reporting back to the list. This bug should now be fixed.
Christian, please check at http://validator.w3.org:8188/.

It took Terje and me quite a while.
Below are the details (feel free to ignore).

But I'm not sure we should leave it as it is. It would
probably make sense to put some limit on overall file
length, to avoid denial of service attacks.

Regards,   Martin.



Terje traced the overflow bug to the UTF-8 checking regexp.
We tried a few things this morning, but didn't make progress.

When I run a simple test program on my side, I confirmed
Terje's suspicion. (I seem to have a perl interpreter with
better error reporting (activestate on win98), he just
got a seg fault).

It looks like Perl cannot match more than 32766 UTF-8 characters,
because it wants to backtrack on that expression.

Here is the test program I used:

 >>>>
#!/usr/local/bin/perl -w
# We need Perl 5.004.
require 5.004;

#
# Load modules
use strict;

my $t = 'a';
my $l = 1;
while (1) {
	if ($t =~ m/
		^(
		[\x00-\x7F]                             # ASCII
		| [\xC2-\xDF]        [\x80-\xBF]        # non-overlong 2-byte sequences
		|  \xE0[\xA0-\xBF]   [\x80-\xBF]        # excluding overlongs
		| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}     # straight 3-byte sequences
		|  \xED[\x80-\x9F]   [\x80-\xBF]        # excluding surrogates
		|  \xF0[\x90-\xBF]   [\x80-\xBF]{2}     # planes 1-3
		| [\xF1-\xF3]        [\x80-\xBF]{3}     # planes 4-15
		|  \xF4[\x80-\x8F][\x80-\xBF]{2}        # plane 16
		)*$
		/x)
	{ print STDERR "matched"; }
	else { print STDERR "unmatched"; }
	print STDERR " length $l\n";
	$t .= $t;
	$l += $l;
}
 >>>>

And here is what I got:

 >>>>
perl test.pl
matched length 1
matched length 2
matched length 4
matched length 8
matched length 16
matched length 32
matched length 64
matched length 128
matched length 256
matched length 512
matched length 1024
matched length 2048
matched length 4096
matched length 8192
matched length 16384
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 32768
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 65536
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 131072
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 262144
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 524288
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 1048576
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 2097152
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 4194304
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 8388608
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 16777216
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 33554432
Complex regular subexpression recursion limit (32766) exceeded at test.pl 
line 12.
unmatched length 67108864
Out of memory!
 >>>>

Of course, there is absolutely no need to do backtracking/recursion, but
perl doesn't see that. After an hour or so of concentrated leafing through
Jeffrey Fiedl's book and the perl book, I guess I finally got
some reasonable workaround:

#!/usr/local/bin/perl -w
# We need Perl 5.004.
require 5.004;

#
# Load modules
use strict;

my $t = 'a';
my $l = 1;
while (1) {
	my $p;
	while ($t =~ m/\G(
		  [\x00-\x7F]                           # ASCII
		| [\xC2-\xDF]        [\x80-\xBF]        # non-overlong 2-byte sequences
		|  \xE0[\xA0-\xBF]   [\x80-\xBF]        # excluding overlongs
		| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}     # straight 3-byte sequences
		|  \xED[\x80-\x9F]   [\x80-\xBF]        # excluding surrogates
		|  \xF0[\x90-\xBF]   [\x80-\xBF]{2}     # planes 1-3
		| [\xF1-\xF3]        [\x80-\xBF]{3}     # planes 4-15
		|  \xF4[\x80-\x8F][\x80-\xBF]{2}        # plane 16
		)
		/xg) { $p = pos $t; }
	if ($p == length $t) {
		print STDERR "matched";
	}
	else {
		print STDERR "unmatched";
	}
	print STDERR " length $l\n";
	$t .= $t;
	$l += $l;
}

The trick is that /g does successive small matches instead of trying
to match it all at one time, \G makes sure we start exactly where
we left off (because we have to match all of the string), and
$p == length $t checks whether we reached the end.
$p = pos $t; is actually necessary in the loop; taken out of the
loop perl complains that $p doesn't have a value.

This produces (after a lot of thrashing) for the longer ones:

matched length 1
matched length 2
matched length 4
matched length 8
matched length 16
matched length 32
matched length 64
matched length 128
matched length 256
matched length 512
matched length 1024
matched length 2048
matched length 4096
matched length 8192
matched length 16384
matched length 32768
matched length 65536
matched length 131072
matched length 262144
matched length 524288
matched length 1048576
matched length 2097152
matched length 4194304
matched length 8388608
matched length 16777216
matched length 33554432
matched length 67108864
matched length 134217728
Out of memory!

Milage may vary on your system. And at some point, I guess
the validator will just have to give up.


Regards,   Martin.


At 04:00 01/07/17 +0200, Terje Bless wrote:
On 29.06.01 at 20:19, Martin Duerst <duerst@w3.org> wrote:

 >At 13:05 01/06/29 +0200, Christian Wolfgang Hujer wrote:
 >>I think I have found a bug in the validator or perhaps the underlaying
 >>software. I got a reproducable 500 internal server error from the
 >>validator when validating the result one of my perl scripts produced. It
 >>contained a long long single line (physical ASCII/UTF-8 line) with more
 >>than 240000 characters resp. bytes (without those unneccessary SP, CR or
 >>LF).
 >
 >Interesting. I'll have a look at it if it turns out that it's an
 >i18n-related problem. Otherwise, I suggest that Terje or Gerald have a
 >look at it.

I finally got around to checking it out and it turns out it's in your code;
specifically in the UTF-8 checking regex. Presumably we run out of memory
when a huge regex is presented with an overlong string. I'm looking into it
in the hopes that I can find a workaround, but we may have to refactor that
code somehow. I'll let you know what I come up with.

BTW, is there any way to split an UTF-8 string such that we are guaranteed
that we split it on a character boundary? Perhaps some simple pattern we
could look for that is guaranteed to indicate a character boundary? Or
maybe a specific number of bytes that will always be a whole number of
UTF-8 characters?

IOW, I'm looking for some safe way we can split that sequence of bytes up,
before feeding it to the regex engine, that will not generate invalid UTF-8
sequences.

Received on Wednesday, 18 July 2001 06:09:32 UTC