Alternate parsing approach for MediaWiki::DumpFile::SQL from Bjoern Hoehrmann on 2010-01-03 (www-archive@w3.org from January 2010)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sun, 03 Jan 2010 05:36:38 +0100
To: www-archive@w3.org
Message-ID: <uk70k5hga1otmd9ac5sjltpcid1u87rnj0@hive.bjoern.hoehrmann.de>
Hi,

  Thanks for your work on MediaWiki::DumpFile::SQL. Unfortunately the
module is only usable for parsing the table creation statement, when I
try to parse "dewiki-20091124-category.sql" (5MB) it sits there for some
minutes on the first $sql->next.

I am wondering whether it would make sense to adopt a different approach
to parsing the INSERT statement, namely create a regular expression from
the schema and use that to match. I've been toying around with this and
came up with the following script. It parses the page table, uses named
captures for convenient key access and a tied hash to do the unescaping.

It's by no means perfect, but a good illustration. It prints the titles
of each page in under a minute (250 MB SQL dump), which is good enough
for my purposes. I'll probably expand this a bit when I find the time,
let me know if you might be interested in this approach, I'd then let
you know if I do.

  package MyTie;
  use base qw/Tie::Hash::NamedCapture/;
  
  sub TIEHASH {
    bless { _ => $_[1] }, $_[0];
  }
  
  sub FETCH {
    my $super = $_[0]->SUPER::FETCH($_[1]);
    return $super unless $_[0]->{_}{$_[1]};
    $super = "$super";

    # Some escaping, not sure if that's sufficient
    $super =~ s/^'|'$//g;
    $super =~ s/\\(.)/\1/g;
    return $super;
  }
  
  package main;
  use strict;
  use warnings;
  use MediaWiki::DumpFile::SQL;
  use IO::File;
  
  my $file = 'dewiki-20091208-page.sql';

  # Use MediaWiki::DumpFile::SQL to get the schema
  my $sql = MediaWiki::DumpFile::SQL->new($file);
    
  my @schema = $sql->schema;
  my $name = $sql->table_name;
  
  my %type2re = (
    int       => qr/NULL|(\d+)/,
    bigint    => qr/NULL|(\d+)/,
    tinyint   => qr/NULL|(\d+)/,
    timestamp => qr/NULL|(\d+)/,
    double    => qr/NULL|(\d+\.\d+)/,
    varbinary => qr/NULL|'(.*?)'/,
    tinyblob  => qr/NULL|'(.*?)'/,
    blob      => qr/NULL|'(.*?)'/,
    char      => qr/NULL|'(.*?)'/,
    varchar   => qr/NULL|'(.*?)'/,
    enum      => qr/NULL|'(.*?)'/,
  );
  
  my %unescape = map { $_ => 1 }
    qw/varbinary tinyblob blob char varchar enum/;
  
  my $f = IO::File->new('<' . $file);
  my $b = "";
  my $seen_insert = 0;

  # Create regex for parsing
  my $re = join ",", map
    { "(?<$_->[0]>" . $type2re{$_->[1]} . ')' } @schema;

  # Keys where the value needs unescaping
  my %unkeys = map { $unescape{$_->[1]} ? ($_->[0] => 1) : () } @schema;
  
  # Read and parse until we have all parsable entries
  while (!$f->eof) {
    my $tmp;
  
    # Read in 1MB blocks
    $f->read($tmp, 1024*1024);
  
    die unless defined $tmp;
  
    $b .= $tmp;

    # Skip to the INSERT statemement  
    if (!$seen_insert and $b =~ s/.*?INSERT INTO .*? VALUES //s) {
      $seen_insert = 1;
    }
  
    next unless $seen_insert;
  
    # Use /c so memorize the position after match failure
    while ($b =~ /\($re\)[,;]/gc) {
      tie %+, "MyTie", \%unkeys;
      printf "%s\n", $+{page_title};
    }

    # Keep only the remainder (TODO: is the length right?)  
    substr $b, 0, pos($b), '';
  }

Feel free to adopt or disregard this as you see fit.

(I've copied the www-archive mailing list which publically archives this
message; should you respond and copy the list, you'll be asked to con-
firm that you are okay with having your reply archived; feel free not to
copy it; archives at http://lists.w3.org/Archives/Public/www-archive/).

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Sunday, 3 January 2010 04:37:10 UTC