- From: Danny Ayers <danny@panlanka.net>
- Date: Wed, 28 Mar 2001 00:09:03 +0600
- To: "Dave Beckett" <dave.beckett@bristol.ac.uk>, <www-rdf-interest@w3.org>
<- However, the resulting files generally aren't usually legal Unicode <- or thus legal XML, so probably your XML/RDF parser will crash and <- burn afterwards on the output anyway if it doesn't get blown away by <- memory leaks/growth. This really seems like a productive area ;-) <- Small enough to enclose below (also deletes Adult area for less <- embarassing demos!) Damn fine idea. I don't speak Perl, what's going on with the 3 values? Cheers, Danny. --- Danny Ayers http://www.isacat.net <- -----Original Message----- <- From: Dave Beckett [mailto:dave.beckett@bristol.ac.uk] <- Sent: 27 March 2001 23:43 <- To: www-rdf-interest@w3.org <- Cc: Danny Ayers <- Subject: Re: Java DMOZ cleaner <- <- <- >>>Danny Ayers said: <- > I've put together a little utility for making the (unzipped) DMOZ dumps <- > readable ... <- <- This inspires me to publish the long-sitting-on-the-shelf perl script <- based on an awk or sed script from Sergey Melnik. I used it last <- year to clean DMOZ dumps (content.rdf.u8). The program does works on <- all the data without sucking up all your memory. <- <- However, the resulting files generally aren't usually legal Unicode <- or thus legal XML, so probably your XML/RDF parser will crash and <- burn afterwards on the output anyway if it doesn't get blown away by <- memory leaks/growth. <- <- Small enough to enclose below (also deletes Adult area for less <- embarassing demos!) <- <- Dave <- <- ---------------------------------------------------------------------- <- #!/usr/bin/perl <- # <- # Convert DMOZ content.rdf.gz data dump into legal RDF <- # (and optionally delete Adult content) <- # <- # Copyright 2000 Dave Beckett, ILRT, University of Bristol <- # http://purl.org/net/dajobe/ <- # <- # USAGE: <- # gunzip -d <content.txt.gz | ./content.perl >content.rdf <- # <- <- my $delete_adult_content=1; <- <- <- my $in_body=0; <- <- # Three values: <- # 0 - before first Adult topic <- # 1 - during Adult topics <- # 2 - afterwards <- my $in_adult_content=0; <- <- while(<>) { <- <- if (/xml version=/) { <- $_ .= qq{<!DOCTYPE rdf:RDF [<!ENTITY dmoz "http://dmoz.org/">]>\n}; <- $in_body=1; <- <- }; <- <- next unless $in_body; <- <- if ($delete_adult_content && <- m%<Topic .*="([^"]+)">%) { <- my $topic=$1; <- if($in_adult_content == 0) { <- $in_adult_content = 1 if $topic =~ /Adult/; <- } elsif( $in_adult_content == 1) { <- if ($topic !~ /Adult/) { <- $in_adult_content = 2; <- $delete_adult_content = 0; # optimisation to prevent extra match <- } <- } <- } <- next if ($delete_adult_content && $in_adult_content == 1); <- <- <- s% about=% r:about=%; <- s%r:id=%r:ID=%; <- s%rdf"%rdf/"%; <- s%TR/RDF/%1999/02/22-rdf-syntax-ns#%; <- s%<RDF %<r:RDF %; <- s%</RDF%</r:RDF%; <- <- s%r:ID="Top"%r:ID="\&dmoz;"%; <- s%(r:ID=")(Top/)(.*)%$1\&dmoz;$3%; <- s%(r:ID=")(.*:Top)(.*)%$1\&dmoz;$3%; <- <- # Quote spaces in URLs (WRONG) correctly <- # s/resource="([^"]+)"/my $url=$1; $url=~s, ,%20,g; <- qq{resource="$url"}/e; <- <- # 1) Quote high-space char (0xA0, 0240 octal) in URLs (WRONG) correctly <- # 2) Remove all text after multiple spaces (WRONG) in URLs - like this: <- # <link r:resource="http://www.unravel.org/tinyspark/ditd/ <- Written <- by: Björk, Sjón and LvT <- Performed by: Björk"/> <- <- s/(about|resource)="([^"]+)"/my($attr,$url)=($1,$2);$url=~s,\240, <- \%A0,g; $url=~s, +.*$,,g; qq{$1="$url"}/e; <- <- # Remove ^N in content <- s/\016//; <- <- print; <- } <-
Received on Tuesday, 27 March 2001 13:12:09 UTC