RE: Java DMOZ cleaner from Danny Ayers on 2001-03-27 (www-rdf-interest@w3.org from March 2001)

From: Danny Ayers <danny@panlanka.net>
Date: Wed, 28 Mar 2001 00:09:03 +0600
To: "Dave Beckett" <dave.beckett@bristol.ac.uk>, <www-rdf-interest@w3.org>
Message-ID: <EBEPLGMHCDOJJJPCFHEFEEFCCPAA.danny@panlanka.net>
<- However, the resulting files generally aren't usually legal Unicode
<- or thus legal XML, so probably your XML/RDF parser will crash and
<- burn afterwards on the output anyway if it doesn't get blown away by
<- memory leaks/growth.

This really seems like a productive area ;-)

<- Small enough to enclose below (also deletes Adult area for less
<- embarassing demos!)

Damn fine idea. I don't speak Perl,  what's going on with the 3 values?

Cheers,
Danny.

---
Danny Ayers
http://www.isacat.net

<- -----Original Message-----
<- From: Dave Beckett [mailto:dave.beckett@bristol.ac.uk]
<- Sent: 27 March 2001 23:43
<- To: www-rdf-interest@w3.org
<- Cc: Danny Ayers
<- Subject: Re: Java DMOZ cleaner
<-
<-
<- >>>Danny Ayers said:
<- > I've put together a little utility for making the (unzipped) DMOZ dumps
<- > readable ...
<-
<- This inspires me to publish the long-sitting-on-the-shelf perl script
<- based on an awk or sed script from Sergey Melnik.  I used it last
<- year to clean DMOZ dumps (content.rdf.u8).  The program does works on
<- all the data without sucking up all your memory.
<-
<- However, the resulting files generally aren't usually legal Unicode
<- or thus legal XML, so probably your XML/RDF parser will crash and
<- burn afterwards on the output anyway if it doesn't get blown away by
<- memory leaks/growth.
<-
<- Small enough to enclose below (also deletes Adult area for less
<- embarassing demos!)
<-
<- Dave
<-
<- ----------------------------------------------------------------------
<- #!/usr/bin/perl
<- #
<- # Convert DMOZ content.rdf.gz data dump into legal RDF
<- # (and optionally delete Adult content)
<- #
<- # Copyright 2000 Dave Beckett, ILRT, University of Bristol
<- # http://purl.org/net/dajobe/
<- #
<- # USAGE:
<- #  gunzip -d <content.txt.gz | ./content.perl >content.rdf
<- #
<-
<- my $delete_adult_content=1;
<-
<-
<- my $in_body=0;
<-
<- # Three values:
<- #    0 - before first Adult topic
<- #    1 - during Adult topics
<- #    2 - afterwards
<- my $in_adult_content=0;
<-
<- while(<>) {
<-
<-   if (/xml version=/) {
<-     $_ .= qq{<!DOCTYPE rdf:RDF [<!ENTITY dmoz "http://dmoz.org/">]>\n};
<-     $in_body=1;
<-
<-   };
<-
<-   next unless $in_body;
<-
<-   if ($delete_adult_content &&
<-       m%<Topic .*="([^"]+)">%) {
<-     my $topic=$1;
<-     if($in_adult_content == 0) {
<-       $in_adult_content = 1 if $topic =~ /Adult/;
<-     } elsif( $in_adult_content == 1) {
<-       if ($topic !~ /Adult/) {
<-         $in_adult_content = 2;
<-         $delete_adult_content = 0; # optimisation to prevent extra match
<-       }
<-     }
<-   }
<-   next if ($delete_adult_content && $in_adult_content == 1);
<-
<-
<-   s% about=% r:about=%;
<-   s%r:id=%r:ID=%;
<-   s%rdf"%rdf/"%;
<-   s%TR/RDF/%1999/02/22-rdf-syntax-ns#%;
<-   s%<RDF %<r:RDF %;
<-   s%</RDF%</r:RDF%;
<-
<-   s%r:ID="Top"%r:ID="\&dmoz;"%;
<-   s%(r:ID=")(Top/)(.*)%$1\&dmoz;$3%;
<-   s%(r:ID=")(.*:Top)(.*)%$1\&dmoz;$3%;
<-
<-   # Quote spaces in URLs (WRONG) correctly
<-   # s/resource="([^"]+)"/my $url=$1; $url=~s, ,%20,g;
<- qq{resource="$url"}/e;
<-
<-   # 1) Quote high-space char (0xA0, 0240 octal) in URLs (WRONG) correctly
<-   # 2) Remove all text after multiple spaces (WRONG) in URLs - like this:
<-   #  <link r:resource="http://www.unravel.org/tinyspark/ditd/
<-                                                         Written
<- by: Björk, Sjón and LvT
<-                   Performed by: Björk"/>
<-
<- s/(about|resource)="([^"]+)"/my($attr,$url)=($1,$2);$url=~s,\240,
<- \%A0,g; $url=~s,  +.*$,,g; qq{$1="$url"}/e;
<-
<-   # Remove ^N in content
<-   s/\016//;
<-
<-   print;
<- }
<-
Received on Tuesday, 27 March 2001 13:12:09 UTC