- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Mon, 21 Jan 2013 19:07:00 +0100
- To: www-archive@w3.org
Hi, E-Mail is an interesting communication medium. A rather curious aspect is that humans seem to adapt themselves to the ever more notable short- comings of the medium, instead of improving the medium. Consider people noting at the beginning of a reply that their "responses are inline", or people manually prefixing their original text with indicators that it is their original text, rather than text they are quoting, which ordinarily one would expect the client software to do automatically, just like it'd be obvious when original text is "inline" in a suitable client. Back in my day people would be rather strongly encouraged to get their mail clients in order, with supplemental software like "OE QuoteFix" if necessary, rather than letting everybody else deal with it, but that is not an option anymore. So I've been thinking about how to write a tool that can repair e-mails with broken quoting or lack of threading information, and issues like that. As a first step, I will list typical segments of e-mails as I encounter them in practise. A proper tool would have to identify them in some way. I will list them as properties of individual characters. Membership in some "segment" is not exclusive, consider that quotes can be nested, and signatures may be quoted, and so on. That's the mental image anyway... Salutations and greetings: an opening line like "Dear Joe," or my "Hi," at the beginning of this e-mail. Such segments should not be quoted in replies and they tend to be annoying in list discussions, because they confuse the discussion mode or give mails an inappropriate tone, but they are a very minor problem. Characteristics are position, length, limited set of words, cites name or part of name of the author of the mail that is being replied to, last character on line, blank line after it. Quote attribution: On Usenet I use "* John Doe wrote in comp.example:", while in mails it is just "* John Doe wrote:" due to lack of group data; I mainly cite the group name to clarify where I read something that has been cross-posted. Characteristics are position, structured format, the character at the end, might cite name of author of parent mail, usually followed by quoted text, in my case the first character is of interest, sometimes there are keywords like "You" and "wrote", can include other properties of the parent message like Message-Id, often includes a date. Attribution block: "-----Original Message-----" followed by headers from the parent message mangled in various ways. Sometimes the field names or the "Original Message" block is localised. Can appear in replies, but also in forwards. Typically the referenced mail is quoted afterwards but usually without any quote indicators. There can be multiple such blocks in a mail. Original text tends to appear above. Characteristics are the format and contents of the block and its position relative to quoted text. I imagine mail client identifiers are correlated. May also appear with quote indicators as if it came from the parent but actually didn't. False From-quoting: Witnesses of the sad history of "mbox" file formats, that combine many e-mail messages in a single file, separated by "From- lines", meaning lines that start with "From " in some form, which then requires escaping such lines when they appear in the body of mails which some of the format designers and implementers kinda forgot to do, and they never quite agreed on what the exact format of the from line is and hence what needs escaping and so on and so forth, moral of the story is that sometimes we get to see mails with, say, ">From " in them, even though the author meant to send "From " without the ">". Characteristic is the literal text plus that it is not an actual quote that follows it. In the uppercase form it would usually come at the beginning of a "sen- tence", and there might be original text above and below it, rather than more quoted text. Corporate legal banner: Angstklauseln telling you to burn your harddrive after reading the mail, or something else along those lines, often attached to postings on mailing lists in violation of list policies. The characteristics probably include keywords, logical position after the original text, repeat use of the same phrase across mails. Signature delimiter: Ideally a single line consisting of "-- " but also sometimes without the trailing space or in a different format altogether like a long line of dashes. Characteristic is the placement before some signature, after the original text, the literal characters in it, with various problems, like that messages may have many such lines due to quoting bugs, actual multiple signatures, and so on. Short signature or valediction: I sometimes use a single line "regards," right before the signature delimiter, could also come in forms like "Thanks" or the initials or full name of the author. Characteristics are line length, sometimes indentation, keywords, repeat use, lack of a sig- nature delimiter, possibly a preceding empty line, after the original text (with exceptions), in some cases references to the author's name. Attached constant signature: The classic. Characteristics are hopefully the appearance right after a signature delimiter at the bottom of mails, but more likely just somewhere after the original text, though I've also seen signatures at the top; repeat use of the same text, often contains resource identifiers, telephone numbers, meatspace addresses, with some strong correlation between signature and author. Should be at most four lines, but can be much longer in practise. Attached randomized signature: Similar as the constant one, but with less repeat use, for instance, because it contains varying quotations taken from a quote collection, or they contents may be context-specific, like some might use Newsgroup-specific signatures, or manually select a special signature, for instance to point new users towards the group FAQ, and so on. Random advertisement: As seen on SourceForge, some lines of marketing blurb with a resource identifier, there typically separated by dashes, after original text but before the mailing list footer. Might be used several times, or might have unique text. Machine-generated footer: "This message is automatically generated by ..." style notices. Generally has the same characteristics as other kinds of signatures, the main reason to identify them might be to avoid confusing them with other kinds of signatures. Mailing list footer: Another workaround for failures in the area of meta data encoding and rendering, typically contains some form of a reminder that the mail is coming from a mailing list the recepient is subscribed to, with redundant instructions on how to unsubscribe or perhaps links to related resources like information for new subscribers. Reference list: Also includes footnotes in various forms, generally con- sisting of one or more `[<identifier>] <link or text>` lines at the end of the original text. Sometimes different syntax like "1. ..." is used. One reason to identify such segments is that it might be wise to retain the expansions when quoting text that refers to them. Characteristics also include that original text refers to the references. Kammquoting: Result of `wrap(quote($wrapped_text))`, long lines with a quote indicator interspersed with short lines without quote indicator. Probably correlates with mail client software. Characteristic is that the text with and without quote indicator is actually from the parent message, in addition to the length pattern. Can be confused with other forms of broken quoting, like writing original text right between two lines of text with proper quote indicator but no empty lines around it. Non-breaking space cheat: My mail client does not allow me to have some overlong line be indented using ordinary white space, consider a overly long indendet address like " http://examle.org/..." which would wrap at the space. To ensure proper formatting, I prefix such lines as necessary with a U+00A0 NO-BREAK SPACE which in turn is configured as a quote mark character, and since my client does not wrap quoted lines, to avoid the Kammquoting problem, this keeps the formatting intact. Interestingly, I have seen others do the same. Pre-formatted original text: Computer code, like source code listings, perhaps some poetry, ASCII art, and other formatting that's best kept intact, like indented lists like, say * Item 1 * Item 2 Characteristics include unusually high number of non-alphanumerics, at times leading white space, lines may be considerably shorter than other original text, might be contained within a specially marked section, "--- cut here 8< ---" style perhaps, but generally difficult to make out without understanding any of the text in it or surrounding it. Reflowable original text: Text manually produced by the poster where white space only separates words and paragraphs, roughly speaking. Quote indicators: Short strings at the beginning of a line that mark a segment as being a quote rather than original text, classical "> " but sadly some people prefer a bare ">" or "|" or " >" and other variations. Syntax may also change at deeper levels of nesting, like using "> " for the top level, and only ">" for deeper levels. "±" is a stranger quote indicator where I have been wondering if that's the result of character encoding problems. Initials-based quote indicators: John Doe quoted as "JD> " or a variant thereof. Quite rare, but annoying enough to detect it properly. I'm not sure what the rules are for more complex names, or if there are consis- tent rules to begin with. White space as quote indicators: Used to be popular with TOFU-HTML pro- ducing clients that also emit attribution blocks. Stylistic text-indent: The two spaces at the beginning of this mail are an example. It may be necessary to identify these to avoid confusion with other segments, like pre-formatted lines. I for instance might try to make the first paragraph more than a single line long to avoud the possbile ambiguity with quoted text, where I usually use two spaces as indentation. Text quoted from ancestors: Text taken from the parent mail, recursively at times. Text quoted from unavailable ancestors: Mails can be delivered out of order, and sometimes mails are filtered out, or replies may be cross- posted to new recepients that have not received, and won't receive, the prior parts of the conversation. That would result in text quoted from ancestors that cannot be found in a dataset. Worse if the mail does not properly indicate its ancestors through References and In-Reply-To, as it's then even harder to determine that the parent is missing. Text quoted from other sources: When quoting text from outside some mail thread, I tend to indent it with two spaces, sometimes more when text is already pre-formatted and indented with spaces. Since I use "> " when I quote from the thread, it's easy to tell these cases apart. Others sadly use the same quote indicator for both kinds of quotation. Flowed parts: Apparently some mis-implementation of format=flowed where the author broke quoted text into parts, where the remainder of a broken quoted line starts " >" while subsequent quoted lines start again with ">" without the leading space. Ellipsis: Sometimes it is useful to explicitly mark omissions from some text that has only been partially quoted, say "> quoted text [...]". It may be necessary to pay special attention to this when trying to decide if some text is being quoted from the parent mail, for instance. Comes in forms like "[...]", "(...)", bare "...", possibly in non-ASCII form. Frankentext: Text smushed together with other text, for instance in some signature that's appended somewhere without newlines. Sometimes comes in a form like "ThanksFrom: ..." or "Joe-- \nsignature" or something like that. Frankenmails: Comments on various other mails stuffed into one. Bizarre practise that breaks threading and encourages people to write things not worth writing a separate mail for. It's useful to detect these so they can be filtered out as spam or trolling, but also necessary to avoid the possible confusion that may result from detecting mails with many parent mails. False quotes: I also have some mails here with lines starting with ">" for no reason, some are from revision control systems where ">" is part of the diff syntax, or part of the computer code, a case of ">>> This is an automated email, please do not reply. <<<" regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Monday, 21 January 2013 18:07:29 UTC