First steps in e-mail content tagging from Bjoern Hoehrmann on 2013-01-27 (www-archive@w3.org from January 2013)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sun, 27 Jan 2013 02:48:52 +0100
To: www-archive@w3.org
Message-ID: <9lo8g8lnqfcj5oi6g62rcf43fqs5j8brad@hive.bjoern.hoehrmann.de>
So,

  Trying to whip up some proof-of-concept code and data to tag parts of
e-mails in order to understand their structure, identifying quoted text,
signatures, and so on, I've started with choosing a text corpus for use
in testing. I've settled on the mailing list archives of the main IETF
mailing list for the year 2012. There are several reasons for choosing
it. First, from ftp://ftp.ietf.org/ietf-mail-archive/ietf anyone can ob-
tain a copy, they are generally, as far as I understand it anyway, IETF
Contributions, which means you can at least look up usage restrictions;
being the main mailing list means that there is a sufficiently large set
of individuals, mail clients, posting habits, and so on, but generally,
posters there are more of the tech-savvy kind; there are plenty of fre-
quent posters, but also people who only post once. It's a mailman list
during 2012, which is fairly standard these days, and so you also have
typical defective mails, like replies to Digest mails; with over 5,000
postings the corpus is reasonably large. It's fairly normal there to
quote from texts not previously sent to the list, which is something I
am particularily interested in. There are long-running threads, but al-
so plain announcements without replies. I assume there are some cross-
postings in the corpuse but have not checked that yet (threads that've
started on some other list are an interesting case).

And so on, one particular benefit is that I read the list myself, and I
am familiar with the topics discussed there and many of the people that
discuss them there, so sanity-checking results from any code or data I
may come up with is relatively easy. A smaller benefit is that the mails
come in the form of "mbox" files. "mbox", that dreaded format, and it's
most misleading to use the singular there. So, having decided on some
mail corpus, the first step would be to take the mbox data apart into
individual mails. With "mbox" this is not really easy. I've been once
more through the list of CPAN modules for mbox and general mail handling
only to find that they are still not very usable, with the kind but ul-
timately unhelpful assistance of the #perl channel on Freenode. For in-
stance, the IETF .mail files start with an empty line. One of the CPAN
modules I've tried could not handle that (at least not without my spe-
cifying some file offset to skip to before trying to read from the file,
which is hard to specify robustly if you do not want to guess that the
empty line is over at one, or maybe two bytes, depending on newline
normalization, and you have to either guess or open the file, check, and
then let the module open the file again, because it only accepts file
names, not file handles or anything else, which is bad, because you may
not actually have a file that can be opened multiple times, like when
you try to read from STDIN).

Looking at my code I seem to have settled on `Mail::Mbox::MessageParser`
as vaguely compatible parser. I haven't checked too carefully if it can
handle the IETF .mail files correctly in all cases, or whether the IETF
files actually can be handled correctly in all cases at all, which is
one of the problems with the myriad of broken .mbox file conventions,
but ... people trying to write quick and ad-hoc proof-of-concept code
can't be choosers. They can still be beggars for better software though.

The software plight continues in the next step, extracting plain text
and perhaps some headers from e-mails. I could not find decent software
that would give me the plain, ideally character-encoding-decoded, text,
not if I want to handle, say, various multipart/* types as they are not
so uncommonly used, even getting the plain text bytes from those doesn't
seem to be implemented in any CPAN module, and I also could not find a
module that would handle MIME word encoded headers (the =?...?= encoding
used for non-ASCII in Subject: and From: and so on) properly, much less
a simple overview whether and how I should apply decoding for that to,
say, Message-Id or Summary or whatever... So I simply ended up using the
relatively ancient MIME-tools package and wrote my own ad-hoc code for
some of the more common types like `multipart/alternative`, and I hope
that using `Mail::Field` will at least extract `charset` parameters pro-
perly. Using the encoding name obtaines thus, I use the Encode module to
decode the bytes, and then re-encode the text to UTF-8, so I can handle
non-UTF-8 sequences somewhere higher up. This of course already triggers
some errors, exceptions actually, because some messages declare one en-
coding, but then use a different one... I catch those exceptions and ig-
nore the message.

As an aside: In times long past, it had already been clear to me that
this whole mail and MIME and whatnot stuff is a complete mess, so I had
been interested in tools that would help to get content into a well-
known-good (or "bad") state, something more tenable than "let us go
shopping". There have been efforts to define alternative formats for
e-mails and possibly mailbox files, like XML-based formats, some even
with Internet-Drafts, which would have less built-in limitations than
the on-the-wire format for e-mails, like they would not have to be 7-
bit formats, so you could use non-ASCII characters directly without any
further encoding -- and the hope was that with such a format defined,
someone might make a good tool that takes the "legacy" content and con-
verts it to a better, higher-level format, say take mbox files and de-
code the transfer- and other encodings, and split the content into some-
thing that can be processed more easily than "write your own parser", as
might have been possible with some XML format, or JSON for all I care,
but neither the formats, nor, for all I know, the tools have so far ma-
terialized. So there still is the barrier-of-entry problem that I've
ranted about above. Anyways...

The most promising line of investigation seems to be the identification
of non-quoted text. A suitable first step for that is identification of
quoted text, which is basically a "find repeated substrings" problem. I
naturally looked for exisiting solutions for that problem, and obviously
there are some (for instance, a DEFLATE compressor would identify sub-
strings that are repeated within a string because DEFLATE works by en-
coding the repetitions as copy instructions, which kind-of brings me
back to my work on http://bjoern.hoehrmann.de/pngwolf/ because there the
major unsolved problem was and still is that I could not find a DEFLATE
encoder that can persist the "find repeating substrings" information, so
minor changes in a string, or concatenating two strings, doesn't require
to re-analyse the whole string from scratch, which is what makes pngwolf
a tad bit slow) but then you get into the usual problem that there is an
implementation of suffix trees, but the easily usable interface to it
assumes your string elements are "bytes" or "characters" while I am more
interested in "words" and "separators", or they do too much work at once
or consume too much memory and so on and so forth...

So, rolling my own, I've made one script that extracts "words" from mbox
files, currently simply as a list of words over all the mails, and one
that finds repeated substrings. As first approximation the second script
simply built an index of all words, and then used a loop to build an in-
dex of all two-word sequences, three-word sequences, and so on, simply
by taking an indexed sequence and looking at what comes as the next word
across the whole corpus. With long repeated sequences that's not very
time efficient, but limiting the sequence length to e.g. two words, the
whole corpus is indexed in a minute or two, using an entire unoptimized
Perl script, which is good enough for my purposes.

A current goal is to see how well one might re-create a thread structure
using only "quoted text" dependency data. That requires a bit more than
simply "repeated text", you want to look for "much repeated text" and
"long repeated substrings", so my current thinking is that I will index
all "words" and all "two-word sequences" and then use that data to build
a data structure that kinda looks like ... Well consider this:

  1191616,1191661,1191975: consider ending
  1191617,1191662,1191976: ending these
  1191618,1191663,1191977: these interminable
  1191619,1191664,1191978: interminable venue
  1191620,1191665,1191979: venue discussions

That's the index for the named two-word sequences for the text "consider
ending these interminable venue discussions" occuring three times in the
corpus. The numbers are indices into the list of all words across all
mails, and as such the data gives you "many of the words in mail A are
also in mail B and C" but it does not give you "long", which is needed,
as simple word repetitions are too common. To get "long" I think a quick
and easy way would be to annotate each index in the list of all words in
all mails with run length and starting point information. So...

  Word#1191616 "consider" (this word also appears at index 1191661 as
                           part of an identical sequence that starts
                           zero words before this point and is five
                           words long)
                          (this word also appears at index 1191975 ...)

Where the relevant information can easily be derived from the index of
words and two-word sequences simply by checking the data for the rele-
vant list index and the related indices +1 (or -1). That data would give
some idea about the length of repeated sequences, with various problems,
like "words" or "separators" that get in the way (say the "$initials>"-
style of attributing quotations, where the initials may occur as "word"
that interrupts the sequence). The hope is that this would yield enough
data to make a first stab at quoted-text-based re-threading using some
fixed thresholds (say, if 3 sequences of length at least 5 words appear
in two mails, there probably is some "same-author" or "ancestor/descen-
dant in the thread" relationship between the mails).

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Sunday, 27 January 2013 01:49:22 UTC