RE: Java DMOZ cleaner

<- However, the resulting files generally aren't usually legal Unicode
<- or thus legal XML, so probably your XML/RDF parser will crash and
<- burn afterwards on the output anyway if it doesn't get blown away by
<- memory leaks/growth.

This really seems like a productive area ;-)

<- Small enough to enclose below (also deletes Adult area for less
<- embarassing demos!)

Damn fine idea. I don't speak Perl,  what's going on with the 3 values?


Danny Ayers

<- -----Original Message-----
<- From: Dave Beckett []
<- Sent: 27 March 2001 23:43
<- To:
<- Cc: Danny Ayers
<- Subject: Re: Java DMOZ cleaner
<- >>>Danny Ayers said:
<- > I've put together a little utility for making the (unzipped) DMOZ dumps
<- > readable ...
<- This inspires me to publish the long-sitting-on-the-shelf perl script
<- based on an awk or sed script from Sergey Melnik.  I used it last
<- year to clean DMOZ dumps (content.rdf.u8).  The program does works on
<- all the data without sucking up all your memory.
<- However, the resulting files generally aren't usually legal Unicode
<- or thus legal XML, so probably your XML/RDF parser will crash and
<- burn afterwards on the output anyway if it doesn't get blown away by
<- memory leaks/growth.
<- Small enough to enclose below (also deletes Adult area for less
<- embarassing demos!)
<- Dave
<- ----------------------------------------------------------------------
<- #!/usr/bin/perl
<- #
<- # Convert DMOZ content.rdf.gz data dump into legal RDF
<- # (and optionally delete Adult content)
<- #
<- # Copyright 2000 Dave Beckett, ILRT, University of Bristol
<- #
<- #
<- # USAGE:
<- #  gunzip -d <content.txt.gz | ./content.perl >content.rdf
<- #
<- my $delete_adult_content=1;
<- my $in_body=0;
<- # Three values:
<- #    0 - before first Adult topic
<- #    1 - during Adult topics
<- #    2 - afterwards
<- my $in_adult_content=0;
<- while(<>) {
<-   if (/xml version=/) {
<-     $_ .= qq{<!DOCTYPE rdf:RDF [<!ENTITY dmoz "">]>\n};
<-     $in_body=1;
<-   };
<-   next unless $in_body;
<-   if ($delete_adult_content &&
<-       m%<Topic .*="([^"]+)">%) {
<-     my $topic=$1;
<-     if($in_adult_content == 0) {
<-       $in_adult_content = 1 if $topic =~ /Adult/;
<-     } elsif( $in_adult_content == 1) {
<-       if ($topic !~ /Adult/) {
<-         $in_adult_content = 2;
<-         $delete_adult_content = 0; # optimisation to prevent extra match
<-       }
<-     }
<-   }
<-   next if ($delete_adult_content && $in_adult_content == 1);
<-   s% about=% r:about=%;
<-   s%r:id=%r:ID=%;
<-   s%rdf"%rdf/"%;
<-   s%TR/RDF/%1999/02/22-rdf-syntax-ns#%;
<-   s%<RDF %<r:RDF %;
<-   s%</RDF%</r:RDF%;
<-   s%r:ID="Top"%r:ID="\&dmoz;"%;
<-   s%(r:ID=")(Top/)(.*)%$1\&dmoz;$3%;
<-   s%(r:ID=")(.*:Top)(.*)%$1\&dmoz;$3%;
<-   # Quote spaces in URLs (WRONG) correctly
<-   # s/resource="([^"]+)"/my $url=$1; $url=~s, ,%20,g;
<- qq{resource="$url"}/e;
<-   # 1) Quote high-space char (0xA0, 0240 octal) in URLs (WRONG) correctly
<-   # 2) Remove all text after multiple spaces (WRONG) in URLs - like this:
<-   #  <link r:resource="
<-                                                         Written
<- by: Björk, Sjón and LvT
<-                   Performed by: Björk"/>
<- s/(about|resource)="([^"]+)"/my($attr,$url)=($1,$2);$url=~s,\240,
<- \%A0,g; $url=~s,  +.*$,,g; qq{$1="$url"}/e;
<-   # Remove ^N in content
<-   s/\016//;
<-   print;
<- }

Received on Tuesday, 27 March 2001 13:12:09 UTC