- From: Danny Ayers <danny@panlanka.net>
- Date: Tue, 27 Mar 2001 23:24:46 +0600
- To: "RDF-Interest" <www-rdf-interest@w3.org>
I've put together a little utility for making the (unzipped) DMOZ dumps readable by David Megginson's RDFFilter. Unfortunately, this turned out to be a bit more problematic than I thought. Using buffered streams, I would have thought this would be straightforward, but no, usually it crashes out due to lack of memory not long after 1M lines. I did have one run that went to completion (about 5M lines - the dump I'm playing with is perhaps 9 months old). Of course I tried again and wiped my result. I've played with various parameters - tried it on Win2k & Linux, pretty much the same behaviour. Looks like there's some fundamental aspect of Java I wasn't aware of... I was thinking about adding a gzip section at either end of the stream, but then this wouldn't get rid of the basic problem. One idea that occurred to me would be to stream the dumps directly off the web -> gzip -> filter -> parser. I imagine though this memory issue would still cause problems. The source is at http://www.isacat.net/2001/code/dmoz/DMOZCleaner.htm - I'd be grateful if someone could tell me where I'm going wrong - this is a really basic problem... Cheers, Danny. --- Danny Ayers http://www.isacat.net
Received on Tuesday, 27 March 2001 12:27:43 UTC