E-Mail segments

Hi,

  E-Mail is an interesting communication medium. A rather curious aspect
is that humans seem to adapt themselves to the ever more notable short-
comings of the medium, instead of improving the medium. Consider people
noting at the beginning of a reply that their "responses are inline", or
people manually prefixing their original text with indicators that it is
their original text, rather than text they are quoting, which ordinarily
one would expect the client software to do automatically, just like it'd
be obvious when original text is "inline" in a suitable client.

Back in my day people would be rather strongly encouraged to get their
mail clients in order, with supplemental software like "OE QuoteFix" if
necessary, rather than letting everybody else deal with it, but that is
not an option anymore. So I've been thinking about how to write a tool
that can repair e-mails with broken quoting or lack of threading
information, and issues like that.

As a first step, I will list typical segments of e-mails as I encounter
them in practise. A proper tool would have to identify them in some way.
I will list them as properties of individual characters. Membership in
some "segment" is not exclusive, consider that quotes can be nested, and
signatures may be quoted, and so on. That's the mental image anyway...

Salutations and greetings: an opening line like "Dear Joe," or my "Hi,"
at the beginning of this e-mail. Such segments should not be quoted in
replies and they tend to be annoying in list discussions, because they
confuse the discussion mode or give mails an inappropriate tone, but
they are a very minor problem. Characteristics are position, length,
limited set of words, cites name or part of name of the author of the
mail that is being replied to, last character on line, blank line after
it.

Quote attribution: On Usenet I use "* John Doe wrote in comp.example:",
while in mails it is just "* John Doe wrote:" due to lack of group data;
I mainly cite the group name to clarify where I read something that has
been cross-posted. Characteristics are position, structured format, the
character at the end, might cite name of author of parent mail, usually
followed by quoted text, in my case the first character is of interest,
sometimes there are keywords like "You" and "wrote", can include other
properties of the parent message like Message-Id, often includes a date.

Attribution block: "-----Original Message-----" followed by headers from
the parent message mangled in various ways. Sometimes the field names or
the "Original Message" block is localised. Can appear in replies, but
also in forwards. Typically the referenced mail is quoted afterwards but
usually without any quote indicators. There can be multiple such blocks
in a mail. Original text tends to appear above. Characteristics are the
format and contents of the block and its position relative to quoted
text. I imagine mail client identifiers are correlated. May also appear
with quote indicators as if it came from the parent but actually didn't.

False From-quoting: Witnesses of the sad history of "mbox" file formats,
that combine many e-mail messages in a single file, separated by "From-
lines", meaning lines that start with "From " in some form, which then
requires escaping such lines when they appear in the body of mails which
some of the format designers and implementers kinda forgot to do, and
they never quite agreed on what the exact format of the from line is and
hence what needs escaping and so on and so forth, moral of the story is
that sometimes we get to see mails with, say, ">From " in them, even
though the author meant to send "From " without the ">". Characteristic
is the literal text plus that it is not an actual quote that follows it.
In the uppercase form it would usually come at the beginning of a "sen-
tence", and there might be original text above and below it, rather than
more quoted text.

Corporate legal banner: Angstklauseln telling you to burn your harddrive
after reading the mail, or something else along those lines, often
attached to postings on mailing lists in violation of list policies. The
characteristics probably include keywords, logical position after the 
original text, repeat use of the same phrase across mails.

Signature delimiter: Ideally a single line consisting of "-- " but also
sometimes without the trailing space or in a different format altogether
like a long line of dashes. Characteristic is the placement before some
signature, after the original text, the literal characters in it, with
various problems, like that messages may have many such lines due to
quoting bugs, actual multiple signatures, and so on.

Short signature or valediction: I sometimes use a single line "regards,"
right before the signature delimiter, could also come in forms like
"Thanks" or the initials or full name of the author. Characteristics are
line length, sometimes indentation, keywords, repeat use, lack of a sig-
nature delimiter, possibly a preceding empty line, after the original
text (with exceptions), in some cases references to the author's name.

Attached constant signature: The classic. Characteristics are hopefully
the appearance right after a signature delimiter at the bottom of mails,
but more likely just somewhere after the original text, though I've also
seen signatures at the top; repeat use of the same text, often contains
resource identifiers, telephone numbers, meatspace addresses, with some
strong correlation between signature and author. Should be at most four
lines, but can be much longer in practise.

Attached randomized signature: Similar as the constant one, but with
less repeat use, for instance, because it contains varying quotations
taken from a quote collection, or they contents may be context-specific,
like some might use Newsgroup-specific signatures, or manually select a
special signature, for instance to point new users towards the group
FAQ, and so on.

Random advertisement: As seen on SourceForge, some lines of marketing
blurb with a resource identifier, there typically separated by dashes,
after original text but before the mailing list footer. Might be used
several times, or might have unique text.

Machine-generated footer: "This message is automatically generated by
..." style notices. Generally has the same characteristics as other
kinds of signatures, the main reason to identify them might be to avoid
confusing them with other kinds of signatures.

Mailing list footer: Another workaround for failures in the area of meta
data encoding and rendering, typically contains some form of a reminder
that the mail is coming from a mailing list the recepient is subscribed
to, with redundant instructions on how to unsubscribe or perhaps links
to related resources like information for new subscribers.

Reference list: Also includes footnotes in various forms, generally con-
sisting of one or more `[<identifier>] <link or text>` lines at the end
of the original text. Sometimes different syntax like "1. ..." is used.
One reason to identify such segments is that it might be wise to retain
the expansions when quoting text that refers to them. Characteristics
also include that original text refers to the references.

Kammquoting: Result of `wrap(quote($wrapped_text))`, long lines with a
quote indicator interspersed with short lines without quote indicator.
Probably correlates with mail client software. Characteristic is that
the text with and without quote indicator is actually from the parent
message, in addition to the length pattern. Can be confused with other
forms of broken quoting, like writing original text right between two
lines of text with proper quote indicator but no empty lines around it.

Non-breaking space cheat: My mail client does not allow me to have some
overlong line be indented using ordinary white space, consider a overly
long indendet address like "  http://examle.org/..." which would wrap at
the space. To ensure proper formatting, I prefix such lines as necessary
with a U+00A0 NO-BREAK SPACE which in turn is configured as a quote mark
character, and since my client does not wrap quoted lines, to avoid the
Kammquoting problem, this keeps the formatting intact. Interestingly, I
have seen others do the same.

Pre-formatted original text: Computer code, like source code listings,
perhaps some poetry, ASCII art, and other formatting that's best kept
intact, like indented lists like, say

  * Item 1
  * Item 2

Characteristics include unusually high number of non-alphanumerics, at
times leading white space, lines may be considerably shorter than other
original text, might be contained within a specially marked section,
"--- cut here 8< ---" style perhaps, but generally difficult to make out
without understanding any of the text in it or surrounding it.

Reflowable original text: Text manually produced by the poster where
white space only separates words and paragraphs, roughly speaking.

Quote indicators: Short strings at the beginning of a line that mark a
segment as being a quote rather than original text, classical "> " but
sadly some people prefer a bare ">" or "|" or " >" and other variations.
Syntax may also change at deeper levels of nesting, like using "> " for
the top level, and only ">" for deeper levels. "±" is a stranger quote
indicator where I have been wondering if that's the result of character
encoding problems.

Initials-based quote indicators: John Doe quoted as "JD> " or a variant
thereof. Quite rare, but annoying enough to detect it properly. I'm not
sure what the rules are for more complex names, or if there are consis-
tent rules to begin with.

White space as quote indicators: Used to be popular with TOFU-HTML pro-
ducing clients that also emit attribution blocks.

Stylistic text-indent: The two spaces at the beginning of this mail are
an example. It may be necessary to identify these to avoid confusion
with other segments, like pre-formatted lines. I for instance might try
to make the first paragraph more than a single line long to avoud the
possbile ambiguity with quoted text, where I usually use two spaces as
indentation.

Text quoted from ancestors: Text taken from the parent mail, recursively
at times.

Text quoted from unavailable ancestors: Mails can be delivered out of
order, and sometimes mails are filtered out, or replies may be cross-
posted to new recepients that have not received, and won't receive, the
prior parts of the conversation. That would result in text quoted from
ancestors that cannot be found in a dataset. Worse if the mail does not
properly indicate its ancestors through References and In-Reply-To, as
it's then even harder to determine that the parent is missing.

Text quoted from other sources: When quoting text from outside some mail
thread, I tend to indent it with two spaces, sometimes more when text is
already pre-formatted and indented with spaces. Since I use "> " when I
quote from the thread, it's easy to tell these cases apart. Others sadly
use the same quote indicator for both kinds of quotation.

Flowed parts: Apparently some mis-implementation of format=flowed where
the author broke quoted text into parts, where the remainder of a broken
quoted line starts " >" while subsequent quoted lines start again with
">" without the leading space.

Ellipsis: Sometimes it is useful to explicitly mark omissions from some
text that has only been partially quoted, say "> quoted text [...]". It
may be necessary to pay special attention to this when trying to decide
if some text is being quoted from the parent mail, for instance. Comes
in forms like "[...]", "(...)", bare "...", possibly in non-ASCII form.

Frankentext: Text smushed together with other text, for instance in some
signature that's appended somewhere without newlines. Sometimes comes in
a form like "ThanksFrom: ..." or "Joe-- \nsignature" or something like
that.

Frankenmails: Comments on various other mails stuffed into one. Bizarre
practise that breaks threading and encourages people to write things not
worth writing a separate mail for. It's useful to detect these so they
can be filtered out as spam or trolling, but also necessary to avoid the
possible confusion that may result from detecting mails with many parent
mails.

False quotes: I also have some mails here with lines starting with ">"
for no reason, some are from revision control systems where ">" is part
of the diff syntax, or part of the computer code, a case of ">>> This is
an automated email, please do not reply. <<<"

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Monday, 21 January 2013 18:07:29 UTC