Re: Error log diving (was: Re: [Fwd: Software error:])

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ville Skyttä <ville.skytta@iki.fi> wrote:

>It's a bit hard to find the interesting entries since validator is quite
>an errorlog-trasher (still, even though I managed to get some of the
>noisiest bugs fixed for 0.6.6).

And we should probably make an effort to reduce this problem even further
fairly quickly.


><http://www.w3.org/TR/query-semantics/>         (~1.6MB) [170MB]
><http://www.w3.org/TR/2003/WD-xsl11-20031217/>  (~1.8MB) [107MB]
><http://www.go-mono.com/[…].Windows.Forms.html> (~0.9MB) [141MB]
>
>Normal, smallish validation cases seem to take 10MB or so per
>"check" process on my box, so 100+ MB is pretty much... ideas?

Process size will balloon with input document size (and hence complexity)
since each element has a gazillion attributes who will show up in the ESIS
whether they're in the physical markup or not. A normal document has a very
large markup:content ratio; the cited documents have inordinatly much markup
compared to the amount of data in them.

Well, or at least that's my theory. :-)

BTW, Björn has (on IRC) just suggested some optimizations that can be used to
avoid some of this overhead in a number of cases. I'll have a look at whether
that can reasonably be done for 0.6.7. The bug on this has been targetted for
0.7 IIRC.


>Running "top" on v.w.o suggests that it seems to kill the "check"
>process once its footprint reaches 100MB when validating any of the
>above URLs.  I did not see any related configuration or limits in
>httpd.conf, and the box does not run out of memory or anything.

Which means these are probably either Apache compile-time limits or Debian
kernel ulimits.


>There is also one 500 from what is apparently caused by someone
>repeatedly (7ish times) clicking the referer badge in the lower right
>hand corner of the results page after having validated a pretty large
>document with show source and show parse tree options on, causing
>ovbiously pretty heavy recursion and an URL with length of about 2k...
>any ideas how we could prevent this?

Look for the User-Agent or similar distinguishing characteristic of the
incoming request, and if it's ourselves we append an extra token ("recursive")
to out User-Agent string. If a request comes in with "recursive" we throw a
fatal error. Add in a configurable prmitted recursion level perhaps...

- -- 
"Temper Temper! Mr. Dre? Mr. NWA? Mr. AK, comin´
 straight outta Compton and y'all better make way?"            -- eminem

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0.3

iQA/AwUBQK/gHqPyPrIkdfXsEQLL5QCg1HJZgRVZhZtOEDaQ1B1Qwkrf4F0An3U4
SWGhS3bzWDuWdgTEBRlHLNo7
=6FgA
-----END PGP SIGNATURE-----

Received on Saturday, 22 May 2004 19:20:05 UTC