Kammquoting, Generators, Iterators, and Lazy Lists for Perl 5 from Bjoern Hoehrmann on 2013-04-04 (www-archive@w3.org from April 2013)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 04 Apr 2013 03:36:43 +0200
To: www-archive@w3.org
Message-ID: <vpbpl81n38eum72umctm9jldquk5tpk1o8@hive.bjoern.hoehrmann.de>
So,

  Specifying rules for defects in emails is a bit difficult without a
proper query language. A simple problem is Kammquoting, what you get
when a mail client wraps lines after quoting them without taking into
account that the lines have been quoted, so you get something like

  > The quick brown fox jumps
  over the lazy dog.

Identifying kammquoted tokens would, one way or another, identify the
wrapping point right after "jumps", which is the last word on a line:

  ->order_by(sub { $_->{line} })
  ->then_by(sub { $_->{column} })
  ->stack_by(sub { $_->{line} })
  ->select(sub { $_->last() })

Here, stack_by groups adjacent elements if they share a certain key,
complementary to group_by which groups elements that share a key, but
ignoring whether are adjacent to one another.

The simple algorithm I use for a first order approximation of what is
quoted text ignores ">" tokens and white space, so to it the above is
pretty much the same as the properly quoted and wrapped alternative.
With the example above, the whole text should appear to be quoted, in-
cluding "jumps, so to find "jumps" we can add

  ->where(sub { $_->{quoted} })

Further, "jumps" appears on a line with a proper quote prefix, so

  ->where(sub { $_->{starts_with_gt} })

Taken together the code would find tokens that appear last on a line
provided they appear to be quoted and the line starts with ">". Then
kammquoted tokens would appear immediately after them, so for each
token like "jumps" in the example, we would look at the

  ->following

elements and look for other elements that have certain properties with-
out interruptions, specifically, elements where the line does not start
with a proper quote prefix, i.e.

  ->take_while(sub { !$_->{starts_with_gt} })

Also, the elements would appear to be quoted, so we can add

  ->take_while(sub { $_->{quoted} })

We also know that in the parent mail, the whole text "The quick brown
fox jumps over the lazy dog" would appear on a single line, so for any
word like "jumps", $r, and any following word $_, we can add

  ->take_while(sub { $_->parent->{line} == $r->parent->{line} })

And if "over" is a kammquoted token, it should come right after "jumps"
in both the parent email and the child email, which in my data model is

  ->take_while(sub {
    $_->parent->{index} - $_->{index} ==
    $r->parent->{index} - $r->{index}
  })

Where index is the index of the token in the list of tokens that make
up the email body (and we could have sorted on that instead of doing a
multi-key sort based on line and column number). Actually, it's a little
bit more complicated than this because I maintain two "views" for each
email, one with all tokens, and one without ">" and whitespace, but that
can be ignored here... I name properties of the filtered view "_f", so

  ->order_by(sub { $_->{line} })
  ->then_by(sub { $_->{column} })
  ->stack_by(sub { $_->{line} })
  ->select(sub { $_->last() })
  ->where(sub { $_->{quoted} })
  ->where(sub { $_->{starts_with_gt} })
  ->select_many(sub {
    my $r = $_;
    $r->following_f
      ->take_while(sub { !$_->{starts_with_gt} })
      ->take_while(sub { $_->{quoted} })
      ->take_while(sub { $_->parent->{line} == $r->parent->{line} })
      ->take_while(sub {
        $_->parent->{index_f} - $_->{index_f} ==
        $r->parent->{index_f} - $r->{index_f}
      })
    });

Is closer to what I actually use at the moment. The `take_while` and
`where` conditions could obviously be combined into a single invokation,
and having to repeat `sub` and other things all the time is not pretty,
and actually identifying the "last token" is not necessary, because the
test for `starts_with_gt` takes care of that, so this is a performance
optimisation, but in any case, this works fine. Actually, the thing that
bugs me most is the `my $r = $_;` line, and that the select_many code is
nested...

Now, theoretically this could be implemented with eager lists, `where`
is a `grep` and select is a `map`, and so on, but it would be very slow,
hard to optimise, hard to automatically execute in parallel, and so on;
so this needs lazy lists, and a decent set of primitives to work on them
like the multi-key `->order_by(...)->then_by(...)` sort. It's actually a
bit funny really, a work-around for the lack of order_by in Perl even
has a name, <http://en.wikipedia.org/wiki/Schwartzian_transform>. That's
nice, but I would think, when you write such code for the third time,
you should actually notice that something is amiss and factor out such
code... Okay, <http://search.cpan.org/dist/List-OrderBy/> for the eager
case (and I note there are a couple of similar modules that offer this).

Now, Perl 5 does not support lazy lists out of the box, and while there
are a couple of modules that implement that as a proof-of-concept, none
of them seem to be well maintained and widely used, and they certainly
do not offer many primitives to work with them. Actually, it's a bit sad
that even for much more trivial tasks you already have to battle with a
couple of modules, like List::Util has `sum` and List::MoreUtils has the
`zip` function, and for arithmetic means you have to use yet another one
and for something like `stack_by` there probably is no module...

So, I rolled my own, <https://gist.github.com/5285994>. It's modeled on
the IEnumerable interface in .NET, as far as method names go. Now, any
basic generator / iterator protocol that's bolted upon a language needs
to address some fundamental issues, like how an iterator indicates that
it cannot provide more elements, and, relatedly, what iterators are able
to enumerate.

I have experimented with a couple of options, and with Perl I have found
that throwing an exception (`StopIteration`) to indicate the end of what
amounts to a list is too confusing and hard to handle, and returning a
special value, like `undef`, is too limiting, making returning "boxed"
values the only option, which luckily is quite easy in Perl as any value
can be turned into a reference to that value.

Another important consideration was that I wanted to support being able
to retrieve the "key" for operations like `group_by` and `stack_by`, and
control when `then_by` is available, which pretty much requires to store
"instance data", and in Perl you cannot, as far as I know, store data on
a code reference; you can store data "in" the code reference, but that
would necessitate calling the code reference to get data out of it...

Experimenting with various options I also ended up writing code that did
not work as expected; I basically did something like

  my $i1 = $enum->take(10);
  my $i2 = $enum->take(10);

Is `$i1` the same sequence as `$i2`? I expected that it is while my code
made those two different (`$i2` would get "the next 10 elements" rather
than the same elements as `$i1`). So being destructive by default struck
me as a bad option. Now there are cases where I actually want to be able
to use the state-modified iterator, like

  my $x = $enum->take_while(...);
  my $y = ... elements after $x ...

and I have yet to consider how to make a suitable API for that, but the
code referenced above implements basically this protocol:

  * generators are ordinary blessed hash references
  * calling ->new on a generator gives an iterator
  * iterators return references to values
  * an exhausted iterator returns a non-reference

The last rule has an implied "once"; making sure that an exhausted iter-
ator *always* returns a non-reference is a diminishing-returns problem.
Especially when you have "nested" ones, it is easy to return a non-re-
ference when the underlying iterator returns a non-reference, but doing
so at any point after the underlying iterator returned a non-reference
typically requires storing some state information, and that is easy to
get wrong... So, as an example, skip_while currently looks like this:

  sub skip_while {
    my ($self, $predicate) = @_;
    return _create sub {
      return sub {
        state $base = $self->new();
        state $skip = 1;
        while ($skip) {
          my $item = $base->();
          return unless ref $item;
          local $_ = $$item;
          $skip &= !! $predicate->($_);
          return $item unless $skip;
        }
        return $base->();
      };
    }
  }

Here the _create function constructs a new blessed hash reference that
can create a new iterator, the inner sub, when requested, and the inner
sub simply returns the "next element" when called. Accordingly, the
inner sub creates a new iterator for the underlying generator when it's
called for the first time and stores it (`state` runs the initializing
expression only "once"). Accordingly, a generator natural numbers is:

  sub integers {
    my ($class, $from) = @_;
    return _create sub {
      return sub {
        state $counter = $from // 0;
        return \($counter++);
      };
    };
  }

If you pass a defined value as first argument, say `0`, this will ge-
nerate the numbers from `0` to ... infinity if you ignore what Perl does
when $counter gets really big...

In closing, .oO{ In `skip_while`, as given above, should it be `return
$$item` rather than `return $item`? }
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Thursday, 4 April 2013 01:37:11 UTC