- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Thu, 28 Feb 2013 03:24:38 +0100
- To: www-archive@w3.org
- Message-ID: <a8cti85np769cljvmevq6s8l3r4g0djh3m@hive.bjoern.hoehrmann.de>
So, I kinda took it for granted that identifying quoted text in e-mails is a relatively hard problem, given that e-mail clients and list archives tend to do so very poorly if at all, and considering all the flamewars on the matter I've seen, and people writing and installing software to ease the pain a little bit. But at least for the case of a single parent and with access to that parent's plain text body, it's rather trivial to come up with something that works almost perfectly. https://gist.github.com/5053420 That's a simple Perl script that parses an mbox file, extracts the plain text from the mails therein, determines the "parent" mail to any mail by means of the In-Reply-To and References headers, transcodes the body to UTF-8 as much as possible and splits the text into a list of tokens. The tokens I am considering is Unicode-aware /\w+/ and any character that is not matched by that (so, words and punctuation in a broad sense). In order to determine whether a given token is "quoted" from the parent, the tokens from child and parent minus white space and ">" are passed to an `sdiff` function (that determines longest common subsequences), and as first approximation it considers any token that the `sdiff` algorithm considers "unchanged" as quoted text (and ">" and white-space is taken as original text). Natural language tends to have some words that appear very frequently, like "and" and "the" in english, and they are a bit of a problem for the `sdiff` algorithm I am using, so some tokens that are original text are identified as quoted. To account for that, there is a second pass that considers whole lines of tokens, and if on a given line more characters are identified as "original" than "quoted", it makes all tokens as "original" (unless the line is too short). I've attached the output from the script when run with the mbox file for the ietf@ietf.org mailing list for July 2012; for the mails that have a parent that is within the mbox file this prints the extract plain text of the mail as HTML <pre> where the quoted tokens are identified via <i> elements (a better choice would be <q> but that generates quote marks I do not want, quote marks that apparently are not copied to the clipboard if you copy the text with the usual browsers, as it turns out). There is one flaw in the `sdiff` algorithm I am using in that it prefers early matches over clustered matches. Consider a TOFU mail like this: Hi. X ... * Max Mustermann wrote: > Hi. Y ... The `sdiff` algorithm will match the "Hi." at the beginning of the mail with the "Hi." in the parent mail, so this appears as, roughly, <quoted>Hi.</quoted> <original>X ... * Max Mustermann wrote:</original> > <original>Hi.</original> <quoted>Y ...</quoted> whereas the first "Hi." is original and the second quoted. This might be easy to fix (among alternatives prefer to make longer quoted sequences), but I have not thought that through yet. A similar problem is with quote attribution lines, there are many cases where "On <date> ... On <date>" has parts incorrectly identified, especially without the second pass as described above. It would be nice to treat "<date>" as a single token so that is less likely to happen; it may be that the Regexp::Common modules can help there. Signatures are also a problem, for similar reasons. There are some other minor problems, but a major benefit in this is that I can now, because it works well enough, experiment with rules like "if almost all lines that consist alomst entirely of quoted tokens share a common prefix (like '>') then perhaps all lines with that prefix are in fact quoted text". Similarily, the quoted text can be dropped in order to search for signatures without getting too confused by quoted sigs... regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Attachments
- text/html attachment: ietf_ietf.org.2012-07.html
Received on Thursday, 28 February 2013 02:26:21 UTC