Identifying attribution lines and blurps from Bjoern Hoehrmann on 2013-03-05 (www-archive@w3.org from March 2013)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Wed, 06 Mar 2013 00:25:36 +0100
To: www-archive@w3.org
Message-ID: <k0vbj8tleu639akmgqadu5frmkmopl3gih@hive.bjoern.hoehrmann.de>
So,

  With http://www.ietf.org/mail-archive/web/rtcweb/current/msg06576.html
it seems I am not the only one feeling that e-mail is going way down; at
http://www.ietf.org/mail-archive/web/rtcweb/current/msg06572.html is an
example. There is an increasingly large body of evidence that people do
indeed type [initials] markers manually...

There are a couple of things that I still need to identify automatically
and this time around I've looked at quote attribution lines (like "You
wrote:"). I do not particularily wish to hard code words like "wrote" in
identification code if it can be avoided (or perhaps be left to a higher
level machine learning component), so I used other characteristics.

The general idea is to identify a region that stands out as attribution
line because it has many of the characteristics of one. For instance, an
attribution line typically ends in a colon ':', so I mark colons at the
end of a line. They often include the name of "the author" of the parent
mail. That's a bit more tricky (which header to you use, do you use the
phrase from the From header only, or also comments, do you retain the
ordering names or do you turn "First Last" into "Last, First", ...). For
now I've used an approximation. The e-mail address of the author of the
parent mail is also a good indicator.

Interestingly, time and date information is still quite common there, so
as a start I take the Date header from the parent mail and split it into
numeric components and search for those literally, and also after I've
adjusted them for the time zone of the child e-mail's Date header. This
seems to cover many, but not all such components, and I have yet to find
out why that is. It's conceivable for instance that the date in attribu-
tion lines is given in GMT while neither parent nor child e-mail use GMT
but there may also be a bug in my time zone code. Sadly I could not find
a suitable module that adjusts the time zone and gives me the components
without my code looking fragile (one should keep in mind that time and
date modules tend to do crazy things like reading $ENV{TZ} which I have
not set, but then my code might break on the next system; starting with
Date::Parse apparently accepting an optional time zone, but it seems to
be ignored when the date string includes a zone, but the document isn't
very elaborate on that point)...

So I basically look for these numbers in the raw text and then map the
match back to my list of tokens (using a suitable index that maps any
offset in the text to the corresponding token, and the match positions
that Perl offers through the `@+` and `@-` lists). In the attached HTML
document the matches are in <b> with a green background colour. It seems
to work fairly well. There would need to be additional logic (like to
ignore these date components in quoted text, and some false positives
could be removed, take telephone numbers for instance), but this seems
to be enough to specify some treshhold for green-ness on a line and e.g.
print the suspected attribution lines out, and filter out mails that do
not have an identified attribution line, to come up with more rules.

(The simpler solution I considered earlier was to perhaps find a module
that can recognise formatted date strings in text, which would seem to
be a fairly standard problem, but apart from many people rolling their
own, limited to a few locales, the better ones offering hooks to add
additional names for days and months, nothing useful turned up. I tried
DateTime::Format::Natural for instance but it split combined dates and
times too much, failed to recognise common ones, and seemed to hand on
larger input texts, besides being seemingly limited to english.)

I have also been thinking about what to do with Outlook blurps like

  -----Original Message-----
  From: ...
  Sent: ...
  ...

It seems to me how these are generated varies with the Outlook version.
For instance, older versions of Outlook Express seem to generate this as
unquoted text and they do not include the Subject, while never versions
are quoted and do include the subject. It's also important to note that
these blurps are localised. They also tend to include things like there
are in attribution lines, but it would require some work to get all the
logic there, like how it ends up with the -bounces address on typical
mailman mailing lists as "From: ...  On Behalf Of ..." or rules for
using phrases versus addresses in To and Cc lines. I figured it would be
easiest to simply look for the Subject, but I would then have to study
how it deals with white space normalization in that header. Also, given
the subject header is not always included, I figure it might be best to
regex that problem away, so I apply roughly this to the raw body (that
is prefixed by a "\n" to avoid dealing with "start of string" versus
"start of line" issues).

  (\n[> ]*)-----[^\n-]+-----\s*
  (\2[^\n]+?:[ ](.+?)){3,}
  (\2\s*\n)

With modifiers sgx this seems to do the trick, short of somebody in-
cluding such text on purpose. It would also be nice if there was better
support for intersections and negations in regular expression implemen-
tations, for instance, this would be more robust if the middle part was
prevented from matching "empty" lines (at the same quote nesting level).
parts matched by this are in <b> with red background colour. (And as an
aside, quoted text is replaced by '.' if that is not obvious).

(Some people seem to trim the blurp, I am not sure yet how to handle
that, if at all. It can also incorrectly match on PGP signatures, an
example is <news:4B0B4778-37BF-4B9C-99A8-865B7BAD59A0@cdt.org> which
quotes a PGP-signed message. That would be addressed by the "no empty
lines" rule I mentioned above but that's not easily available... I'd
think I can cheat by requiring upper- and lowecase letters in whatever
"Original Message" is translated to, would work for german and english
and genuine PGP signatures, but beyond that I lack empirical data...)

(Also, the Outlook blurp sometimes appears without the initial dashed
line, possibly generated by webmail products by the same vendor. That
seems to be a bit harder to deal with without generating more false
positives, so long as the contents of such blurps is largely ignored.)

Some other things I've noticed include that apparently a popular webmail
product likes to add entirely redundant duplicates of addresses, think

  <example@example.com>

becoming

  <example@example.com<mailto:example@example.com>>

Also for 'http:' addresses, and sometimes `<javascript:;>` is the added
text. Primarily in supposedly quoted text. And that's just the latest in
a long list of mail crimes. Something also seems to add `*` characters
in quoted text, partly in similarily mangled addresses, and sometimes it
seems for emphasis. I have no idea how people sleep at night when their
code generates rubbish like these links, perhaps people just snap after
a while of reading MIME specs. A sanctuary for webmail programmers may
be in order.

I will probably regex the address additions, marking the tokens they add
as quoted if the surrounding text is quoted, and probably add a general
rule that non-word tokens surrounded by quoted text are also considered
quoted. That is needed anyway to handle inline ">" ...

(Sample document attached.)

regards.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Attachments

text/html attachment: 2012-07.mail.html
Received on Tuesday, 5 March 2013 23:27:19 UTC