Hello Danny > I've put together a little utility for making the (unzipped) DMOZ dumps > readable by David Megginson's RDFFilter. Unfortunately, this turned out to > be a bit more problematic than I thought. Using buffered streams, I would > have thought this would be straightforward, but no, usually it crashes out > due to lack of memory not long after 1M lines. I did have one run that went > to completion (about 5M lines - the dump I'm playing with is perhaps 9 > months old). Of course I tried again and wiped my result. I've played with > various parameters - tried it on Win2k & Linux, pretty much the same > behaviour. Looks like there's some fundamental aspect of Java I wasn't aware > of... I am not sure Java is the right tool for this, what about using UNIX sed like commands? :-) http://www-diglib.stanford.edu/diglib/ginf/download/dmoz/ regards AlbertoReceived on Tuesday, 27 March 2001 12:49:37 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:19:48 GMT