Re: Java DMOZ cleaner

Hello Danny

> I've put together a little utility for making the (unzipped) DMOZ dumps
> readable by David Megginson's RDFFilter. Unfortunately, this turned out to
> be a bit more problematic than I thought. Using buffered streams, I would
> have thought this would be straightforward, but no, usually it crashes out
> due to lack of memory not long after 1M lines. I did have one run that went
> to completion (about 5M lines - the dump I'm playing with is perhaps 9
> months old). Of course I tried again and wiped my result. I've played with
> various parameters - tried it on Win2k & Linux, pretty much the same
> behaviour. Looks like there's some fundamental aspect of Java I wasn't aware
> of...

I am not sure Java is the right tool for this, what about using UNIX sed
like commands? :-)

http://www-diglib.stanford.edu/diglib/ginf/download/dmoz/


regards

Alberto

Received on Tuesday, 27 March 2001 12:49:37 UTC