- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 18 Jul 2001 19:04:30 +0900
- To: www-validator@w3.org, "Christian Wolfgang Hujer" <christian@hujer.com>
Reporting back to the list. This bug should now be fixed. Christian, please check at http://validator.w3.org:8188/. It took Terje and me quite a while. Below are the details (feel free to ignore). But I'm not sure we should leave it as it is. It would probably make sense to put some limit on overall file length, to avoid denial of service attacks. Regards, Martin. Terje traced the overflow bug to the UTF-8 checking regexp. We tried a few things this morning, but didn't make progress. When I run a simple test program on my side, I confirmed Terje's suspicion. (I seem to have a perl interpreter with better error reporting (activestate on win98), he just got a seg fault). It looks like Perl cannot match more than 32766 UTF-8 characters, because it wants to backtrack on that expression. Here is the test program I used: >>>> #!/usr/local/bin/perl -w # We need Perl 5.004. require 5.004; # # Load modules use strict; my $t = 'a'; my $l = 1; while (1) { if ($t =~ m/ ^( [\x00-\x7F] # ASCII | [\xC2-\xDF] [\x80-\xBF] # non-overlong 2-byte sequences | \xE0[\xA0-\xBF] [\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte sequences | \xED[\x80-\x9F] [\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF] [\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3] [\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$ /x) { print STDERR "matched"; } else { print STDERR "unmatched"; } print STDERR " length $l\n"; $t .= $t; $l += $l; } >>>> And here is what I got: >>>> perl test.pl matched length 1 matched length 2 matched length 4 matched length 8 matched length 16 matched length 32 matched length 64 matched length 128 matched length 256 matched length 512 matched length 1024 matched length 2048 matched length 4096 matched length 8192 matched length 16384 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 32768 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 65536 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 131072 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 262144 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 524288 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 1048576 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 2097152 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 4194304 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 8388608 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 16777216 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 33554432 Complex regular subexpression recursion limit (32766) exceeded at test.pl line 12. unmatched length 67108864 Out of memory! >>>> Of course, there is absolutely no need to do backtracking/recursion, but perl doesn't see that. After an hour or so of concentrated leafing through Jeffrey Fiedl's book and the perl book, I guess I finally got some reasonable workaround: #!/usr/local/bin/perl -w # We need Perl 5.004. require 5.004; # # Load modules use strict; my $t = 'a'; my $l = 1; while (1) { my $p; while ($t =~ m/\G( [\x00-\x7F] # ASCII | [\xC2-\xDF] [\x80-\xBF] # non-overlong 2-byte sequences | \xE0[\xA0-\xBF] [\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte sequences | \xED[\x80-\x9F] [\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF] [\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3] [\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 ) /xg) { $p = pos $t; } if ($p == length $t) { print STDERR "matched"; } else { print STDERR "unmatched"; } print STDERR " length $l\n"; $t .= $t; $l += $l; } The trick is that /g does successive small matches instead of trying to match it all at one time, \G makes sure we start exactly where we left off (because we have to match all of the string), and $p == length $t checks whether we reached the end. $p = pos $t; is actually necessary in the loop; taken out of the loop perl complains that $p doesn't have a value. This produces (after a lot of thrashing) for the longer ones: matched length 1 matched length 2 matched length 4 matched length 8 matched length 16 matched length 32 matched length 64 matched length 128 matched length 256 matched length 512 matched length 1024 matched length 2048 matched length 4096 matched length 8192 matched length 16384 matched length 32768 matched length 65536 matched length 131072 matched length 262144 matched length 524288 matched length 1048576 matched length 2097152 matched length 4194304 matched length 8388608 matched length 16777216 matched length 33554432 matched length 67108864 matched length 134217728 Out of memory! Milage may vary on your system. And at some point, I guess the validator will just have to give up. Regards, Martin. At 04:00 01/07/17 +0200, Terje Bless wrote: On 29.06.01 at 20:19, Martin Duerst <duerst@w3.org> wrote: >At 13:05 01/06/29 +0200, Christian Wolfgang Hujer wrote: >>I think I have found a bug in the validator or perhaps the underlaying >>software. I got a reproducable 500 internal server error from the >>validator when validating the result one of my perl scripts produced. It >>contained a long long single line (physical ASCII/UTF-8 line) with more >>than 240000 characters resp. bytes (without those unneccessary SP, CR or >>LF). > >Interesting. I'll have a look at it if it turns out that it's an >i18n-related problem. Otherwise, I suggest that Terje or Gerald have a >look at it. I finally got around to checking it out and it turns out it's in your code; specifically in the UTF-8 checking regex. Presumably we run out of memory when a huge regex is presented with an overlong string. I'm looking into it in the hopes that I can find a workaround, but we may have to refactor that code somehow. I'll let you know what I come up with. BTW, is there any way to split an UTF-8 string such that we are guaranteed that we split it on a character boundary? Perhaps some simple pattern we could look for that is guaranteed to indicate a character boundary? Or maybe a specific number of bytes that will always be a whole number of UTF-8 characters? IOW, I'm looking for some safe way we can split that sequence of bytes up, before feeding it to the regex engine, that will not generate invalid UTF-8 sequences.
Received on Wednesday, 18 July 2001 06:09:32 UTC