Java DMOZ cleaner

I've put together a little utility for making the (unzipped) DMOZ dumps
readable by David Megginson's RDFFilter. Unfortunately, this turned out to
be a bit more problematic than I thought. Using buffered streams, I would
have thought this would be straightforward, but no, usually it crashes out
due to lack of memory not long after 1M lines. I did have one run that went
to completion (about 5M lines - the dump I'm playing with is perhaps 9
months old). Of course I tried again and wiped my result. I've played with
various parameters - tried it on Win2k & Linux, pretty much the same
behaviour. Looks like there's some fundamental aspect of Java I wasn't aware
of...

I was thinking about adding a gzip section at either end of the stream, but
then this wouldn't get rid of the basic problem. One idea that occurred to
me would be to stream the dumps directly off the web -> gzip -> filter ->
parser. I imagine though this memory issue would still cause problems.

The source is at http://www.isacat.net/2001/code/dmoz/DMOZCleaner.htm - I'd
be grateful if someone could tell me where I'm going wrong - this is a
really basic problem...

Cheers,
Danny.

---
Danny Ayers
http://www.isacat.net

Received on Tuesday, 27 March 2001 12:27:43 UTC