- From: Danny Ayers <danny@panlanka.net>
- Date: Wed, 28 Mar 2001 00:09:03 +0600
- To: "Dave Beckett" <dave.beckett@bristol.ac.uk>, <www-rdf-interest@w3.org>
<- However, the resulting files generally aren't usually legal Unicode
<- or thus legal XML, so probably your XML/RDF parser will crash and
<- burn afterwards on the output anyway if it doesn't get blown away by
<- memory leaks/growth.
This really seems like a productive area ;-)
<- Small enough to enclose below (also deletes Adult area for less
<- embarassing demos!)
Damn fine idea. I don't speak Perl, what's going on with the 3 values?
Cheers,
Danny.
---
Danny Ayers
http://www.isacat.net
<- -----Original Message-----
<- From: Dave Beckett [mailto:dave.beckett@bristol.ac.uk]
<- Sent: 27 March 2001 23:43
<- To: www-rdf-interest@w3.org
<- Cc: Danny Ayers
<- Subject: Re: Java DMOZ cleaner
<-
<-
<- >>>Danny Ayers said:
<- > I've put together a little utility for making the (unzipped) DMOZ dumps
<- > readable ...
<-
<- This inspires me to publish the long-sitting-on-the-shelf perl script
<- based on an awk or sed script from Sergey Melnik. I used it last
<- year to clean DMOZ dumps (content.rdf.u8). The program does works on
<- all the data without sucking up all your memory.
<-
<- However, the resulting files generally aren't usually legal Unicode
<- or thus legal XML, so probably your XML/RDF parser will crash and
<- burn afterwards on the output anyway if it doesn't get blown away by
<- memory leaks/growth.
<-
<- Small enough to enclose below (also deletes Adult area for less
<- embarassing demos!)
<-
<- Dave
<-
<- ----------------------------------------------------------------------
<- #!/usr/bin/perl
<- #
<- # Convert DMOZ content.rdf.gz data dump into legal RDF
<- # (and optionally delete Adult content)
<- #
<- # Copyright 2000 Dave Beckett, ILRT, University of Bristol
<- # http://purl.org/net/dajobe/
<- #
<- # USAGE:
<- # gunzip -d <content.txt.gz | ./content.perl >content.rdf
<- #
<-
<- my $delete_adult_content=1;
<-
<-
<- my $in_body=0;
<-
<- # Three values:
<- # 0 - before first Adult topic
<- # 1 - during Adult topics
<- # 2 - afterwards
<- my $in_adult_content=0;
<-
<- while(<>) {
<-
<- if (/xml version=/) {
<- $_ .= qq{<!DOCTYPE rdf:RDF [<!ENTITY dmoz "http://dmoz.org/">]>\n};
<- $in_body=1;
<-
<- };
<-
<- next unless $in_body;
<-
<- if ($delete_adult_content &&
<- m%<Topic .*="([^"]+)">%) {
<- my $topic=$1;
<- if($in_adult_content == 0) {
<- $in_adult_content = 1 if $topic =~ /Adult/;
<- } elsif( $in_adult_content == 1) {
<- if ($topic !~ /Adult/) {
<- $in_adult_content = 2;
<- $delete_adult_content = 0; # optimisation to prevent extra match
<- }
<- }
<- }
<- next if ($delete_adult_content && $in_adult_content == 1);
<-
<-
<- s% about=% r:about=%;
<- s%r:id=%r:ID=%;
<- s%rdf"%rdf/"%;
<- s%TR/RDF/%1999/02/22-rdf-syntax-ns#%;
<- s%<RDF %<r:RDF %;
<- s%</RDF%</r:RDF%;
<-
<- s%r:ID="Top"%r:ID="\&dmoz;"%;
<- s%(r:ID=")(Top/)(.*)%$1\&dmoz;$3%;
<- s%(r:ID=")(.*:Top)(.*)%$1\&dmoz;$3%;
<-
<- # Quote spaces in URLs (WRONG) correctly
<- # s/resource="([^"]+)"/my $url=$1; $url=~s, ,%20,g;
<- qq{resource="$url"}/e;
<-
<- # 1) Quote high-space char (0xA0, 0240 octal) in URLs (WRONG) correctly
<- # 2) Remove all text after multiple spaces (WRONG) in URLs - like this:
<- # <link r:resource="http://www.unravel.org/tinyspark/ditd/
<- Written
<- by: Björk, Sjón and LvT
<- Performed by: Björk"/>
<-
<- s/(about|resource)="([^"]+)"/my($attr,$url)=($1,$2);$url=~s,\240,
<- \%A0,g; $url=~s, +.*$,,g; qq{$1="$url"}/e;
<-
<- # Remove ^N in content
<- s/\016//;
<-
<- print;
<- }
<-
Received on Tuesday, 27 March 2001 13:12:09 UTC