Re: Java DMOZ cleaner

>>>Danny Ayers said:
> <- However, the resulting files generally aren't usually legal Unicode
> <- or thus legal XML, so probably your XML/RDF parser will crash and
> <- burn afterwards on the output anyway if it doesn't get blown away by
> <- memory leaks/growth.
> 
> This really seems like a productive area ;-)

Well, some parsers such as the C ones: Jason Diamonds' Repat and
my Rapier, can handle all the data up to the illegal character
sequences and are only limited by I/O speed, not memory.  The java
ones all tend to leak until they collapse, unless your machine has
oodles of memory.

> 
> <- Small enough to enclose below (also deletes Adult area for less
> <- embarassing demos!)
> 
> Damn fine idea. I don't speak Perl,  what's going on with the 3 values?

As it says:

> <- #    0 - before first Adult topic
> <- #    1 - during Adult topics
> <- #    2 - afterwards

So it has three states in processing.  not really too
important... but best not to create an RDF pr0n database.

Incidently, Alberto pointed out the original sed code I converted

http://www-diglib.stanford.edu/diglib/ginf/download/dmoz/

but that runs 10-20x slower than the perl

Dave

Received on Tuesday, 27 March 2001 13:43:35 UTC