Re: Java DMOZ cleaner from Dave Beckett on 2001-03-27 (www-rdf-interest@w3.org from March 2001)

From: Dave Beckett <dave.beckett@bristol.ac.uk>
Date: Tue, 27 Mar 2001 18:42:35 +0100
To: www-rdf-interest@w3.org
CC: Danny Ayers <danny@panlanka.net>
Message-ID: <16232.985714955@tatooine.ilrt.bris.ac.uk>

>>>Danny Ayers said:
> I've put together a little utility for making the (unzipped) DMOZ dumps
> readable ...

This inspires me to publish the long-sitting-on-the-shelf perl script
based on an awk or sed script from Sergey Melnik.  I used it last
year to clean DMOZ dumps (content.rdf.u8).  The program does works on
all the data without sucking up all your memory.

However, the resulting files generally aren't usually legal Unicode
or thus legal XML, so probably your XML/RDF parser will crash and
burn afterwards on the output anyway if it doesn't get blown away by
memory leaks/growth.

Small enough to enclose below (also deletes Adult area for less
embarassing demos!)

Dave

----------------------------------------------------------------------
#!/usr/bin/perl
#
# Convert DMOZ content.rdf.gz data dump into legal RDF
# (and optionally delete Adult content)
#
# Copyright 2000 Dave Beckett, ILRT, University of Bristol
# http://purl.org/net/dajobe/
#
# USAGE:
#  gunzip -d <content.txt.gz | ./content.perl >content.rdf
#

my $delete_adult_content=1;


my $in_body=0;

# Three values:
#    0 - before first Adult topic
#    1 - during Adult topics
#    2 - afterwards
my $in_adult_content=0;

while(<>) {
  
  if (/xml version=/) {
    $_ .= qq{<!DOCTYPE rdf:RDF [<!ENTITY dmoz "http://dmoz.org/">]>\n};
    $in_body=1;
    
  };

  next unless $in_body;

  if ($delete_adult_content &&
      m%<Topic .*="([^"]+)">%) {
    my $topic=$1;
    if($in_adult_content == 0) {
      $in_adult_content = 1 if $topic =~ /Adult/;      
    } elsif( $in_adult_content == 1) {
      if ($topic !~ /Adult/) {
        $in_adult_content = 2;
        $delete_adult_content = 0; # optimisation to prevent extra match
      }
    }
  }
  next if ($delete_adult_content && $in_adult_content == 1);


  s% about=% r:about=%;
  s%r:id=%r:ID=%;
  s%rdf"%rdf/"%;
  s%TR/RDF/%1999/02/22-rdf-syntax-ns#%;
  s%<RDF %<r:RDF %;
  s%</RDF%</r:RDF%;
  
  s%r:ID="Top"%r:ID="\&dmoz;"%;
  s%(r:ID=")(Top/)(.*)%$1\&dmoz;$3%;
  s%(r:ID=")(.*:Top)(.*)%$1\&dmoz;$3%;

  # Quote spaces in URLs (WRONG) correctly
  # s/resource="([^"]+)"/my $url=$1; $url=~s, ,%20,g; qq{resource="$url"}/e;

  # 1) Quote high-space char (0xA0, 0240 octal) in URLs (WRONG) correctly
  # 2) Remove all text after multiple spaces (WRONG) in URLs - like this:
  #  <link r:resource="http://www.unravel.org/tinyspark/ditd/                                                            Written by: Bj�rk, Sj�n and LvT                                                            Performed by: Bj�rk"/>
  s/(about|resource)="([^"]+)"/my($attr,$url)=($1,$2);$url=~s,\240,\%A0,g; $url=~s,  +.*$,,g; qq{$1="$url"}/e; 

  # Remove ^N in content
  s/\016//;

  print;
}

Received on Tuesday, 27 March 2001 12:42:37 UTC