- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Thu, 04 Apr 2013 03:36:43 +0200
- To: www-archive@w3.org
So, Specifying rules for defects in emails is a bit difficult without a proper query language. A simple problem is Kammquoting, what you get when a mail client wraps lines after quoting them without taking into account that the lines have been quoted, so you get something like > The quick brown fox jumps over the lazy dog. Identifying kammquoted tokens would, one way or another, identify the wrapping point right after "jumps", which is the last word on a line: ->order_by(sub { $_->{line} }) ->then_by(sub { $_->{column} }) ->stack_by(sub { $_->{line} }) ->select(sub { $_->last() }) Here, stack_by groups adjacent elements if they share a certain key, complementary to group_by which groups elements that share a key, but ignoring whether are adjacent to one another. The simple algorithm I use for a first order approximation of what is quoted text ignores ">" tokens and white space, so to it the above is pretty much the same as the properly quoted and wrapped alternative. With the example above, the whole text should appear to be quoted, in- cluding "jumps, so to find "jumps" we can add ->where(sub { $_->{quoted} }) Further, "jumps" appears on a line with a proper quote prefix, so ->where(sub { $_->{starts_with_gt} }) Taken together the code would find tokens that appear last on a line provided they appear to be quoted and the line starts with ">". Then kammquoted tokens would appear immediately after them, so for each token like "jumps" in the example, we would look at the ->following elements and look for other elements that have certain properties with- out interruptions, specifically, elements where the line does not start with a proper quote prefix, i.e. ->take_while(sub { !$_->{starts_with_gt} }) Also, the elements would appear to be quoted, so we can add ->take_while(sub { $_->{quoted} }) We also know that in the parent mail, the whole text "The quick brown fox jumps over the lazy dog" would appear on a single line, so for any word like "jumps", $r, and any following word $_, we can add ->take_while(sub { $_->parent->{line} == $r->parent->{line} }) And if "over" is a kammquoted token, it should come right after "jumps" in both the parent email and the child email, which in my data model is ->take_while(sub { $_->parent->{index} - $_->{index} == $r->parent->{index} - $r->{index} }) Where index is the index of the token in the list of tokens that make up the email body (and we could have sorted on that instead of doing a multi-key sort based on line and column number). Actually, it's a little bit more complicated than this because I maintain two "views" for each email, one with all tokens, and one without ">" and whitespace, but that can be ignored here... I name properties of the filtered view "_f", so ->order_by(sub { $_->{line} }) ->then_by(sub { $_->{column} }) ->stack_by(sub { $_->{line} }) ->select(sub { $_->last() }) ->where(sub { $_->{quoted} }) ->where(sub { $_->{starts_with_gt} }) ->select_many(sub { my $r = $_; $r->following_f ->take_while(sub { !$_->{starts_with_gt} }) ->take_while(sub { $_->{quoted} }) ->take_while(sub { $_->parent->{line} == $r->parent->{line} }) ->take_while(sub { $_->parent->{index_f} - $_->{index_f} == $r->parent->{index_f} - $r->{index_f} }) }); Is closer to what I actually use at the moment. The `take_while` and `where` conditions could obviously be combined into a single invokation, and having to repeat `sub` and other things all the time is not pretty, and actually identifying the "last token" is not necessary, because the test for `starts_with_gt` takes care of that, so this is a performance optimisation, but in any case, this works fine. Actually, the thing that bugs me most is the `my $r = $_;` line, and that the select_many code is nested... Now, theoretically this could be implemented with eager lists, `where` is a `grep` and select is a `map`, and so on, but it would be very slow, hard to optimise, hard to automatically execute in parallel, and so on; so this needs lazy lists, and a decent set of primitives to work on them like the multi-key `->order_by(...)->then_by(...)` sort. It's actually a bit funny really, a work-around for the lack of order_by in Perl even has a name, <http://en.wikipedia.org/wiki/Schwartzian_transform>. That's nice, but I would think, when you write such code for the third time, you should actually notice that something is amiss and factor out such code... Okay, <http://search.cpan.org/dist/List-OrderBy/> for the eager case (and I note there are a couple of similar modules that offer this). Now, Perl 5 does not support lazy lists out of the box, and while there are a couple of modules that implement that as a proof-of-concept, none of them seem to be well maintained and widely used, and they certainly do not offer many primitives to work with them. Actually, it's a bit sad that even for much more trivial tasks you already have to battle with a couple of modules, like List::Util has `sum` and List::MoreUtils has the `zip` function, and for arithmetic means you have to use yet another one and for something like `stack_by` there probably is no module... So, I rolled my own, <https://gist.github.com/5285994>. It's modeled on the IEnumerable interface in .NET, as far as method names go. Now, any basic generator / iterator protocol that's bolted upon a language needs to address some fundamental issues, like how an iterator indicates that it cannot provide more elements, and, relatedly, what iterators are able to enumerate. I have experimented with a couple of options, and with Perl I have found that throwing an exception (`StopIteration`) to indicate the end of what amounts to a list is too confusing and hard to handle, and returning a special value, like `undef`, is too limiting, making returning "boxed" values the only option, which luckily is quite easy in Perl as any value can be turned into a reference to that value. Another important consideration was that I wanted to support being able to retrieve the "key" for operations like `group_by` and `stack_by`, and control when `then_by` is available, which pretty much requires to store "instance data", and in Perl you cannot, as far as I know, store data on a code reference; you can store data "in" the code reference, but that would necessitate calling the code reference to get data out of it... Experimenting with various options I also ended up writing code that did not work as expected; I basically did something like my $i1 = $enum->take(10); my $i2 = $enum->take(10); Is `$i1` the same sequence as `$i2`? I expected that it is while my code made those two different (`$i2` would get "the next 10 elements" rather than the same elements as `$i1`). So being destructive by default struck me as a bad option. Now there are cases where I actually want to be able to use the state-modified iterator, like my $x = $enum->take_while(...); my $y = ... elements after $x ... and I have yet to consider how to make a suitable API for that, but the code referenced above implements basically this protocol: * generators are ordinary blessed hash references * calling ->new on a generator gives an iterator * iterators return references to values * an exhausted iterator returns a non-reference The last rule has an implied "once"; making sure that an exhausted iter- ator *always* returns a non-reference is a diminishing-returns problem. Especially when you have "nested" ones, it is easy to return a non-re- ference when the underlying iterator returns a non-reference, but doing so at any point after the underlying iterator returned a non-reference typically requires storing some state information, and that is easy to get wrong... So, as an example, skip_while currently looks like this: sub skip_while { my ($self, $predicate) = @_; return _create sub { return sub { state $base = $self->new(); state $skip = 1; while ($skip) { my $item = $base->(); return unless ref $item; local $_ = $$item; $skip &= !! $predicate->($_); return $item unless $skip; } return $base->(); }; } } Here the _create function constructs a new blessed hash reference that can create a new iterator, the inner sub, when requested, and the inner sub simply returns the "next element" when called. Accordingly, the inner sub creates a new iterator for the underlying generator when it's called for the first time and stores it (`state` runs the initializing expression only "once"). Accordingly, a generator natural numbers is: sub integers { my ($class, $from) = @_; return _create sub { return sub { state $counter = $from // 0; return \($counter++); }; }; } If you pass a defined value as first argument, say `0`, this will ge- nerate the numbers from `0` to ... infinity if you ignore what Perl does when $counter gets really big... In closing, .oO{ In `skip_while`, as given above, should it be `return $$item` rather than `return $item`? } -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Thursday, 4 April 2013 01:37:11 UTC