- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Sun, 03 Jan 2010 05:36:38 +0100
- To: www-archive@w3.org
Hi, Thanks for your work on MediaWiki::DumpFile::SQL. Unfortunately the module is only usable for parsing the table creation statement, when I try to parse "dewiki-20091124-category.sql" (5MB) it sits there for some minutes on the first $sql->next. I am wondering whether it would make sense to adopt a different approach to parsing the INSERT statement, namely create a regular expression from the schema and use that to match. I've been toying around with this and came up with the following script. It parses the page table, uses named captures for convenient key access and a tied hash to do the unescaping. It's by no means perfect, but a good illustration. It prints the titles of each page in under a minute (250 MB SQL dump), which is good enough for my purposes. I'll probably expand this a bit when I find the time, let me know if you might be interested in this approach, I'd then let you know if I do. package MyTie; use base qw/Tie::Hash::NamedCapture/; sub TIEHASH { bless { _ => $_[1] }, $_[0]; } sub FETCH { my $super = $_[0]->SUPER::FETCH($_[1]); return $super unless $_[0]->{_}{$_[1]}; $super = "$super"; # Some escaping, not sure if that's sufficient $super =~ s/^'|'$//g; $super =~ s/\\(.)/\1/g; return $super; } package main; use strict; use warnings; use MediaWiki::DumpFile::SQL; use IO::File; my $file = 'dewiki-20091208-page.sql'; # Use MediaWiki::DumpFile::SQL to get the schema my $sql = MediaWiki::DumpFile::SQL->new($file); my @schema = $sql->schema; my $name = $sql->table_name; my %type2re = ( int => qr/NULL|(\d+)/, bigint => qr/NULL|(\d+)/, tinyint => qr/NULL|(\d+)/, timestamp => qr/NULL|(\d+)/, double => qr/NULL|(\d+\.\d+)/, varbinary => qr/NULL|'(.*?)'/, tinyblob => qr/NULL|'(.*?)'/, blob => qr/NULL|'(.*?)'/, char => qr/NULL|'(.*?)'/, varchar => qr/NULL|'(.*?)'/, enum => qr/NULL|'(.*?)'/, ); my %unescape = map { $_ => 1 } qw/varbinary tinyblob blob char varchar enum/; my $f = IO::File->new('<' . $file); my $b = ""; my $seen_insert = 0; # Create regex for parsing my $re = join ",", map { "(?<$_->[0]>" . $type2re{$_->[1]} . ')' } @schema; # Keys where the value needs unescaping my %unkeys = map { $unescape{$_->[1]} ? ($_->[0] => 1) : () } @schema; # Read and parse until we have all parsable entries while (!$f->eof) { my $tmp; # Read in 1MB blocks $f->read($tmp, 1024*1024); die unless defined $tmp; $b .= $tmp; # Skip to the INSERT statemement if (!$seen_insert and $b =~ s/.*?INSERT INTO .*? VALUES //s) { $seen_insert = 1; } next unless $seen_insert; # Use /c so memorize the position after match failure while ($b =~ /\($re\)[,;]/gc) { tie %+, "MyTie", \%unkeys; printf "%s\n", $+{page_title}; } # Keep only the remainder (TODO: is the length right?) substr $b, 0, pos($b), ''; } Feel free to adopt or disregard this as you see fit. (I've copied the www-archive mailing list which publically archives this message; should you respond and copy the list, you'll be asked to con- firm that you are okay with having your reply archived; feel free not to copy it; archives at http://lists.w3.org/Archives/Public/www-archive/). regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Sunday, 3 January 2010 04:37:10 UTC