Identifying quoted text in e-mails turns out to be easy from Bjoern Hoehrmann on 2013-02-28 (www-archive@w3.org from February 2013)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 28 Feb 2013 03:24:38 +0100
To: www-archive@w3.org
Message-ID: <a8cti85np769cljvmevq6s8l3r4g0djh3m@hive.bjoern.hoehrmann.de>
So,

  I kinda took it for granted that identifying quoted text in e-mails is
a relatively hard problem, given that e-mail clients and list archives
tend to do so very poorly if at all, and considering all the flamewars
on the matter I've seen, and people writing and installing software to
ease the pain a little bit. But at least for the case of a single parent
and with access to that parent's plain text body, it's rather trivial to
come up with something that works almost perfectly.

  https://gist.github.com/5053420

That's a simple Perl script that parses an mbox file, extracts the plain
text from the mails therein, determines the "parent" mail to any mail by
means of the In-Reply-To and References headers, transcodes the body to
UTF-8 as much as possible and splits the text into a list of tokens. The
tokens I am considering is Unicode-aware /\w+/ and any character that is
not matched by that (so, words and punctuation in a broad sense).

In order to determine whether a given token is "quoted" from the parent,
the tokens from child and parent minus white space and ">" are passed to
an `sdiff` function (that determines longest common subsequences), and
as first approximation it considers any token that the `sdiff` algorithm
considers "unchanged" as quoted text (and ">" and white-space is taken
as original text). Natural language tends to have some words that appear
very frequently, like "and" and "the" in english, and they are a bit of
a problem for the `sdiff` algorithm I am using, so some tokens that are
original text are identified as quoted. To account for that, there is a
second pass that considers whole lines of tokens, and if on a given line
more characters are identified as "original" than "quoted", it makes all
tokens as "original" (unless the line is too short).

I've attached the output from the script when run with the mbox file for
the ietf@ietf.org mailing list for July 2012; for the mails that have a
parent that is within the mbox file this prints the extract plain text
of the mail as HTML <pre> where the quoted tokens are identified via <i>
elements (a better choice would be <q> but that generates quote marks I
do not want, quote marks that apparently are not copied to the clipboard
if you copy the text with the usual browsers, as it turns out).

There is one flaw in the `sdiff` algorithm I am using in that it prefers
early matches over clustered matches. Consider a TOFU mail like this:

  Hi. X ...

  * Max Mustermann wrote:
  > Hi. Y ...

The `sdiff` algorithm will match the "Hi." at the beginning of the mail
with the "Hi." in the parent mail, so this appears as, roughly,

  <quoted>Hi.</quoted> <original>X ...

  * Max Mustermann wrote:</original>
  > <original>Hi.</original> <quoted>Y ...</quoted>

whereas the first "Hi." is original and the second quoted. This might be
easy to fix (among alternatives prefer to make longer quoted sequences),
but I have not thought that through yet. A similar problem is with quote
attribution lines, there are many cases where "On <date> ... On <date>"
has parts incorrectly identified, especially without the second pass as
described above. It would be nice to treat "<date>" as a single token so
that is less likely to happen; it may be that the Regexp::Common modules
can help there. Signatures are also a problem, for similar reasons.

There are some other minor problems, but a major benefit in this is that
I can now, because it works well enough, experiment with rules like "if
almost all lines that consist alomst entirely of quoted tokens share a
common prefix (like '>') then perhaps all lines with that prefix are in
fact quoted text". Similarily, the quoted text can be dropped in order
to search for signatures without getting too confused by quoted sigs...

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Attachments

text/html attachment: ietf_ietf.org.2012-07.html
Received on Thursday, 28 February 2013 02:26:21 UTC